merge

83242886 · Jack · 72079991 · ce63a4f5 · 83242886 · 83242886
36 changed file
--- a/README.md
+++ b/README.md
-简体中文 | [English](README_en.md)
+English | [简体中文](README_cn.md)
-文档：[https://paddledetection.readthedocs.io](https://paddledetection.readthedocs.io)
+Documentation:[https://paddledetection.readthedocs.io](https://paddledetection.readthedocs.io)
 # PaddleDetection
-飞桨推出的PaddleDetection是端到端目标检测开发套件，旨在帮助开发者更快更好地完成检测模型的训练、精度速度优化到部署全流程。PaddleDetection以模块化的设计实现了多种主流目标检测算法，并且提供了丰富的数据增强、网络组件、损失函数等模块，集成了模型压缩和跨平台高性能部署能力。目前基于PaddleDetection已经完成落地的项目涉及工业质检、遥感图像检测、无人巡检等多个领域。
+PaddleDetection is an end-to-end object detection development kit based on PaddlePaddle, which
+aims to help developers in the whole development of training models, optimizing performance and
+inference speed, and deploying models. PaddleDetection provides varied object detection architectures
+in modular design, and wealthy data augmentation methods, network components, loss functions, etc.
+PaddleDetection supported practical projects such as industrial quality inspection, remote sensing
+image object detection, and automatic inspection with its practical features such as model compression
+and multi-platform deployment.
-**目前检测库下模型均要求使用PaddlePaddle 1.7及以上版本或适当的develop版本。**
+[PP-YOLO](https://arxiv.org/abs/2007.12099), which is faster and has higer performance than YOLOv4,
+has been released, it reached mAP(0.5:0.95) as 45.2% on COCO test2019 dataset and 72.9 FPS on single
+Test V100. Please refer to [PP-YOLO](configs/ppyolo/README.md) for details.
+**Now all models in PaddleDetection require PaddlePaddle version 1.8 or higher, or suitable develop version.**
 <div align="center">
  <img src="docs/images/000000570688.jpg" />
 </div>
-## 简介
+## Introduction
-特性：
- 模型丰富：
+Features:
-  PaddleDetection提供了丰富的模型，包含目标检测、实例分割、人脸检测等100+个预训练模型，涵盖多种数据集竞赛冠军方案、适合云端/边缘端设备部署的检测方案。
+- Rich models:
- 易部署:
+  PaddleDetection provides rich of models, including 100+ pre-trained models
+such as object detection, instance segmentation, face detection etc. It covers
+the champion models, the practical detection models for cloud and edge device.
-  PaddleDetection的模型中使用的核心算子均通过C++或CUDA实现，同时基于PaddlePaddle的高性能推理引擎可以方便地部署在多种硬件平台上。
+- Production Ready:
- 高灵活度：
+  Key operations are implemented in C++ and CUDA, together with PaddlePaddle's
+highly efficient inference engine, enables easy deployment in server environments.
-  PaddleDetection通过模块化设计来解耦各个组件，基于配置文件可以轻松地搭建各种检测模型。
+- Highly Flexible:
- 高性能：
+  Components are designed to be modular. Model architectures, as well as data
+preprocess pipelines, can be easily customized with simple configuration
+changes.
-  基于PaddlePaddle框架的高性能内核，在模型训练速度、显存占用上有一定的优势。例如，YOLOv3的训练速度快于其他框架，在Tesla V100 16GB环境下，Mask-RCNN(ResNet50)可以单卡Batch Size可以达到4 (甚至到5)。
+- Performance Optimized:
+  With the help of the underlying PaddlePaddle framework, faster training and
+reduced GPU memory footprint is achieved. Notably, YOLOv3 training is
+much faster compared to other frameworks. Another example is Mask-RCNN
+(ResNet50), we managed to fit up to 4 images per GPU (Tesla V100 16GB) during
+multi-GPU training.
-支持的模型结构：
+Supported Architectures:
-|                    | ResNet | ResNet-vd <sup>[1](#vd)</sup> | ResNeXt-vd | SENet | MobileNet |  HRNet | Res2Net |
+|                     | ResNet | ResNet-vd <sup>[1](#vd)</sup> | ResNeXt-vd | SENet | MobileNet |  HRNet | Res2Net |
-|--------------------|:------:|------------------------------:|:----------:|:-----:|:---------:|:------:| :--:    |
+| ------------------- | :----: | ----------------------------: | :--------: | :---: | :-------: |:------:|:-----:  |
-| Faster R-CNN       | ✓      |                             ✓ | x          | ✓     | ✗         |  ✗     |  ✗      |
+| Faster R-CNN        |   ✓    |                             ✓ |     x      |   ✓   |     ✗     |   ✗    |  ✗      |
-| Faster R-CNN + FPN | ✓      |                             ✓ | ✓          | ✓     | ✗         |  ✓     |  ✓      |
+| Faster R-CNN + FPN  |   ✓    |                             ✓ |     ✓      |   ✓   |     ✗     |   ✓    |  ✓      |
-| Mask R-CNN         | ✓      |                             ✓ | x          | ✓     | ✗         |  ✗     |  ✗      |
+| Mask R-CNN          |   ✓    |                             ✓ |     x      |   ✓   |     ✗     |   ✗    |  ✗      |
-| Mask R-CNN + FPN   | ✓      |                             ✓ | ✓          | ✓     | ✗         |  ✗     |  ✓      |
+| Mask R-CNN + FPN    |   ✓    |                             ✓ |     ✓      |   ✓   |     ✗     |   ✗    |  ✓      |
-| Cascade Faster-RCNN | ✓     |                             ✓ | ✓          | ✗     | ✗         |  ✗     |  ✗      |
+| Cascade Faster-RCNN |   ✓    |                             ✓ |     ✓      |   ✗   |     ✗     |   ✗    |  ✗      |
-| Cascade Mask-RCNN  | ✓      |                             ✗ | ✗          | ✓     | ✗         |  ✗     |  ✗      |
+| Cascade Mask-RCNN   |   ✓    |                             ✗ |     ✗      |   ✓   |     ✗     |   ✗    |  ✗      |
-| Libra R-CNN        | ✗      |                             ✓ | ✗          | ✗     | ✗         |  ✗     |  ✗      |
+| Libra R-CNN         |   ✗    |                             ✓ |     ✗      |   ✗   |     ✗     |   ✗    |  ✗      |
-| RetinaNet          | ✓      |                             ✗ | ✓          | ✗     | ✗         |  ✗     |  ✗      |
+| RetinaNet           |   ✓    |                             ✗ |     ✗      |   ✗   |     ✗     |   ✗    |  ✗      |
-| YOLOv3             | ✓      |                             ✓ | ✗          | ✗     | ✓         |  ✗     |  ✗      |
+| YOLOv3              |   ✓    |                             ✓ |     ✗      |   ✗   |     ✓     |   ✗    |  ✗      |
-| SSD                | ✗      |                             ✗ | ✗          | ✗     | ✓         |  ✗     |  ✗      |
+| SSD                 |   ✗    |                             ✗ |     ✗      |   ✗   |     ✓     |   ✗    |  ✗      |
-| BlazeFace          | ✗      |                             ✗ | ✗          | ✗     | ✗         |  ✗     |  ✗      |
+| BlazeFace           |   ✗    |                             ✗ |     ✗      |   ✗   |     ✗     |   ✗    |  ✗      |
-| Faceboxes          | ✗      |                             ✗ | ✗          | ✗     | ✗         |  ✗     |  ✗      |
+| Faceboxes           |   ✗    |                             ✗ |     ✗      |   ✗   |     ✗     |   ✗    |  ✗      |
-<a name="vd">[1]</a> [ResNet-vd](https://arxiv.org/pdf/1812.01187) 模型预测速度基本不变的情况下提高了精度。
+<a name="vd">[1]</a> [ResNet-vd](https://arxiv.org/pdf/1812.01187) models offer much improved accuracy with negligible performance cost.
-**说明：** ✓ 为[模型库](docs/MODEL_ZOO_cn.md)中提供了对应配置文件和预训练模型，✗ 为未提供参考配置，但一般都支持。
+**NOTE:** ✓ for config file and pretrain model provided in [Model Zoo](docs/MODEL_ZOO.md), ✗ for not provided but is supported generally.
-更多的模型:
+More models:
 - EfficientDet
 - FCOS
 - CornerNet-Squeeze
 - YOLOv4
+- PP-YOLO
-更多的Backone：
+More Backbones:
 - DarkNet
 - VGG
 - GCNet
 - CBNet
- Hourglass
-扩展特性：
+Advanced Features:
 - [x] **Synchronized Batch Norm**
 - [x] **Group Norm**
 - [x] **Modulated Deformable Convolution**
 - [x] **Deformable PSRoI Pooling**
- [x] **Non-local和GCNet**
+- [x] **Non-local and GCNet**
+**NOTE:** Synchronized batch normalization can only be used on multiple GPU devices, can not be used on CPU devices or single GPU device.
+The following is the relationship between COCO mAP and FPS on Tesla V100 of representative models of each architectures and backbones.
+<div align="center">
+  <img src="docs/images/map_fps.png" width=800 />
+</div>
-**注意:** Synchronized batch normalization 只能在多GPU环境下使用，不能在CPU环境或者单GPU环境下使用。
+**NOTE:**
+- `CBResNet` stands for `Cascade-Faster-RCNN-CBResNet200vd-FPN`, which has highest mAP on COCO as 53.3% in PaddleDetection models
+- `Cascade-Faster-RCNN` stands for `Cascade-Faster-RCNN-ResNet50vd-DCN`, which has been optimized to 20 FPS inference speed when COCO mAP as 47.8%
+- The enhanced `YOLOv3-ResNet50vd-DCN` is 10.6 absolute percentage points higher than paper on COCO mAP, and inference speed is nearly 70% faster than the darknet framework
+- All these models can be get in [Model Zoo](#Model-Zoo)
-以下为选取各模型结构和骨干网络的代表模型COCO数据集精度mAP和单卡Tesla V100上预测速度(FPS)关系图。
+The following is the relationship between COCO mAP and FPS on Tesla V100 of SOTA object detecters and PP-YOLO, which is faster and has better performance than YOLOv4, and reached mAP(0.5:0.95) as 45.2% on COCO test2019 dataset and 72.9 FPS on single Test V100. Please refer to [PP-YOLO](configs/ppyolo/README.md) for details.
 <div align="center">
-  <img src="docs/images/map_fps.png" />
+  <img src="docs/images/ppyolo_map_fps.png" width=600 />
 </div>
-**说明：**
+## Tutorials
- `CBResNet`为`Cascade-Faster-RCNN-CBResNet200vd-FPN`模型，COCO数据集mAP高达53.3%
- `Cascade-Faster-RCNN`为`Cascade-Faster-RCNN-ResNet50vd-DCN`，PaddleDetection将其优化到COCO数据mAP为47.8%时推理速度为20FPS
- PaddleDetection增强版`YOLOv3-ResNet50vd-DCN`在COCO数据集mAP高于原作10.6个绝对百分点，推理速度为61.3FPS，快于原作约70%
+### Get Started
- 图中模型均可在[模型库](#模型库)中获取
+- [Installation guide](docs/tutorials/INSTALL.md)
-## 文档教程
+- [Quick start on small dataset](docs/tutorials/QUICK_STARTED.md)
+- [Train/Evaluation/Inference](docs/tutorials/GETTING_STARTED.md)
-### 入门教程
+- [How to train a custom dataset](docs/tutorials/Custom_DataSet.md)
+- [FAQ](docs/FAQ.md)
- [安装说明](docs/tutorials/INSTALL_cn.md)
- [快速开始](docs/tutorials/QUICK_STARTED_cn.md)
+### Advanced Tutorial
- [训练/评估/预测流程](docs/tutorials/GETTING_STARTED_cn.md)
- [如何训练自定义数据集](docs/tutorials/Custom_DataSet.md)
+- [Guide to preprocess pipeline and dataset definition](docs/advanced_tutorials/READER.md)
- [常见问题汇总](docs/FAQ.md)
+- [Models technical](docs/advanced_tutorials/MODEL_TECHNICAL.md)
+- [Transfer learning document](docs/advanced_tutorials/TRANSFER_LEARNING.md)
-### 进阶教程
+- [Parameter configuration](docs/advanced_tutorials/config_doc):
- [数据预处理及数据集定义](docs/advanced_tutorials/READER.md)
+  - [Introduction to the configuration workflow](docs/advanced_tutorials/config_doc/CONFIG.md)
- [搭建模型步骤](docs/advanced_tutorials/MODEL_TECHNICAL.md)
+  - [Parameter configuration for RCNN model](docs/advanced_tutorials/config_doc/RCNN_PARAMS_DOC.md)
- [模型参数配置](docs/advanced_tutorials/config_doc):
-  - [配置模块设计和介绍](docs/advanced_tutorials/config_doc/CONFIG_cn.md)
-  - [RCNN模型参数说明](docs/advanced_tutorials/config_doc/RCNN_PARAMS_DOC.md)
- [迁移学习教程](docs/advanced_tutorials/TRANSFER_LEARNING_cn.md)
 - [IPython Notebook demo](demo/mask_rcnn_demo.ipynb)
- [模型压缩](slim)
+- [Model compression](slim)
-    - [压缩benchmark](slim)
+    - [Model compression benchmark](slim)
-    - [量化](slim/quantization)
+    - [Quantization](slim/quantization)
-    - [剪枝](slim/prune)
+    - [Model pruning](slim/prune)
-    - [蒸馏](slim/distillation)
+    - [Model distillation](slim/distillation)
-    - [神经网络搜索](slim/nas)
+    - [Neural Architecture Search](slim/nas)
- [推理部署](deploy)
+- [Deployment](deploy)
-    - [模型导出教程](docs/advanced_tutorials/deploy/EXPORT_MODEL.md)
+    - [Export model for inference](docs/advanced_tutorials/deploy/EXPORT_MODEL.md)
-    - [Python端推理部署](deploy/python)
+    - [Python inference](deploy/python)
-    - [C++端推理部署](deploy/cpp)
+    - [C++ inference](deploy/cpp)
-    - [推理Benchmark](docs/advanced_tutorials/deploy/BENCHMARK_INFER_cn.md)
+    - [Inference benchmark](docs/advanced_tutorials/deploy/BENCHMARK_INFER_cn.md)
-## 模型库
+## Model Zoo
- [模型库](docs/MODEL_ZOO_cn.md)
+- Pretrained models are available in the [PaddleDetection model zoo](docs/MODEL_ZOO.md).
- [移动端模型](configs/mobile/README.md)
+- [Mobile models](configs/mobile/README.md)
- [Anchor free模型](configs/anchor_free/README.md)
+- [Anchor free models](configs/anchor_free/README.md)
- [人脸检测模型](docs/featured_model/FACE_DETECTION.md)
+- [Face detection models](docs/featured_model/FACE_DETECTION_en.md)
- [YOLOv3增强模型](docs/featured_model/YOLOv3_ENHANCEMENT.md): COCO mAP高达43.6%，原论文精度为33.0%
+- [Pretrained models for pedestrian detection](docs/featured_model/CONTRIB.md)
- [行人检测预训练模型](docs/featured_model/CONTRIB_cn.md)
+- [Pretrained models for vehicle detection](docs/featured_model/CONTRIB.md)
- [车辆检测预训练模型](docs/featured_model/CONTRIB_cn.md)
+- [YOLOv3 enhanced model](docs/featured_model/YOLOv3_ENHANCEMENT.md): Compared to MAP of 33.0% in paper, enhanced YOLOv3 reaches the MAP of 43.6%, and inference speed is improved as well
- [Objects365 2019 Challenge夺冠模型](docs/featured_model/champion_model/CACascadeRCNN.md)
+- [PP-YOLO](configs/ppyolo/README.md): PP-YOLO reeached mAP as 45.3% on COCO dataset，and 72.9 FPS on single Tesla V100
- [Open Images 2019-Object Detction比赛最佳单模型](docs/featured_model/champion_model/OIDV5_BASELINE_MODEL.md)
+- [Objects365 2019 Challenge champion model](docs/featured_model/champion_model/CACascadeRCNN.md)
- [服务器端实用目标检测模型](configs/rcnn_enhance/README.md): V100上速度20FPS时，COCO mAP高达47.8%。
+- [Best single model of Open Images 2019-Object Detction](docs/featured_model/champion_model/OIDV5_BASELINE_MODEL.md)
- [大规模实用目标检测模型](docs/featured_model/LARGE_SCALE_DET_MODEL.md): 提供了包含676个类别的大规模服务器端实用目标检测模型，适用于绝大部分使用场景，可以直接用来预测，也可以用于微调其他任务。
+- [Practical Server-side detection method](configs/rcnn_enhance/README_en.md): Inference speed on single V100 GPU can reach 20FPS when COCO mAP is 47.8%.
+- [Large-scale practical object detection models](docs/featured_model/LARGE_SCALE_DET_MODEL_en.md): Large-scale practical server-side detection pretrained models with 676 categories are provided for most application scenarios, which can be used not only for direct inference but also finetuning on other datasets.
-## 许可证书
-本项目的发布受[Apache 2.0 license](LICENSE)许可认证。
+## License
+PaddleDetection is released under the [Apache 2.0 license](LICENSE).
-## 版本更新
-v0.3.0版本已经在`05/2020`发布，增加Anchor-free、EfficientDet和YOLOv4等多个模型，推出移动端、服务器端实用高效多个模型，例如移动端将YOLOv3-MobileNetv3加速3.5倍，服务器端优化两阶段模型，速度和精度具备较高性价比。重构预测部署功能，提升易用性，修复已知诸多bug等，详细内容请参考[版本更新文档](docs/CHANGELOG.md)。
+## Updates
+v0.4.0 was released at `05/2020`, add PP-YOLO, TTFNet, HTC, ACFPN, etc. And add BlaceFace face landmark detection model, add a series of optimized SSDLite models on mobile side, add data augmentations GridMask and RandomErasing, add Matrix NMS and EMA training, and improved ease of use, fix many known bugs, etc.
-## 如何贡献代码
+Please refer to [版本更新文档](docs/CHANGELOG.md) for details.
-我们非常欢迎你可以为PaddleDetection提供代码，也十分感谢你的反馈。
+## Contributing
+Contributions are highly welcomed and we would really appreciate your feedback!!
--- a/README_cn.md
+++ b/README_cn.md
+简体中文 | [English](README.md)
+文档：[https://paddledetection.readthedocs.io](https://paddledetection.readthedocs.io)
+# PaddleDetection
+飞桨推出的PaddleDetection是端到端目标检测开发套件，旨在帮助开发者更快更好地完成检测模型的训练、精度速度优化到部署全流程。PaddleDetection以模块化的设计实现了多种主流目标检测算法，并且提供了丰富的数据增强、网络组件、损失函数等模块，集成了模型压缩和跨平台高性能部署能力。目前基于PaddleDetection已经完成落地的项目涉及工业质检、遥感图像检测、无人巡检等多个领域。
+PaddleDetection新发布精度速度领先的[PP-YOLO](https://arxiv.org/abs/2007.12099)模型，COCO数据集精度达到45.2%，单卡Tesla V100预测速度达到72.9 FPS，详细信息见[PP-YOLO模型](configs/ppyolo/README_cn.md)
+**目前检测库下模型均要求使用PaddlePaddle 1.8及以上版本或适当的develop版本。**
+<div align="center">
+  <img src="docs/images/000000570688.jpg" />
+</div>
+## 简介
+特性：
+- 模型丰富：
+  PaddleDetection提供了丰富的模型，包含目标检测、实例分割、人脸检测等100+个预训练模型，涵盖多种数据集竞赛冠军方案、适合云端/边缘端设备部署的检测方案。
+- 易部署:
+  PaddleDetection的模型中使用的核心算子均通过C++或CUDA实现，同时基于PaddlePaddle的高性能推理引擎可以方便地部署在多种硬件平台上。
+- 高灵活度：
+  PaddleDetection通过模块化设计来解耦各个组件，基于配置文件可以轻松地搭建各种检测模型。
+- 高性能：
+  基于PaddlePaddle框架的高性能内核，在模型训练速度、显存占用上有一定的优势。例如，YOLOv3的训练速度快于其他框架，在Tesla V100 16GB环境下，Mask-RCNN(ResNet50)可以单卡Batch Size可以达到4 (甚至到5)。
+支持的模型结构：
+|                    | ResNet | ResNet-vd <sup>[1](#vd)</sup> | ResNeXt-vd | SENet | MobileNet |  HRNet | Res2Net |
+|--------------------|:------:|------------------------------:|:----------:|:-----:|:---------:|:------:| :--:    |
+| Faster R-CNN       | ✓      |                             ✓ | x          | ✓     | ✗         |  ✗     |  ✗      |
+| Faster R-CNN + FPN | ✓      |                             ✓ | ✓          | ✓     | ✗         |  ✓     |  ✓      |
+| Mask R-CNN         | ✓      |                             ✓ | x          | ✓     | ✗         |  ✗     |  ✗      |
+| Mask R-CNN + FPN   | ✓      |                             ✓ | ✓          | ✓     | ✗         |  ✗     |  ✓      |
+| Cascade Faster-RCNN | ✓     |                             ✓ | ✓          | ✗     | ✗         |  ✗     |  ✗      |
+| Cascade Mask-RCNN  | ✓      |                             ✗ | ✗          | ✓     | ✗         |  ✗     |  ✗      |
+| Libra R-CNN        | ✗      |                             ✓ | ✗          | ✗     | ✗         |  ✗     |  ✗      |
+| RetinaNet          | ✓      |                             ✗ | ✓          | ✗     | ✗         |  ✗     |  ✗      |
+| YOLOv3             | ✓      |                             ✓ | ✗          | ✗     | ✓         |  ✗     |  ✗      |
+| SSD                | ✗      |                             ✗ | ✗          | ✗     | ✓         |  ✗     |  ✗      |
+| BlazeFace          | ✗      |                             ✗ | ✗          | ✗     | ✗         |  ✗     |  ✗      |
+| Faceboxes          | ✗      |                             ✗ | ✗          | ✗     | ✗         |  ✗     |  ✗      |
+<a name="vd">[1]</a> [ResNet-vd](https://arxiv.org/pdf/1812.01187) 模型预测速度基本不变的情况下提高了精度。
+**说明：** ✓ 为[模型库](docs/MODEL_ZOO_cn.md)中提供了对应配置文件和预训练模型，✗ 为未提供参考配置，但一般都支持。
+更多的模型:
+- EfficientDet
+- FCOS
+- CornerNet-Squeeze
+- YOLOv4
+- PP-YOLO
+更多的Backone：
+- DarkNet
+- VGG
+- GCNet
+- CBNet
+- Hourglass
+扩展特性：
+- [x] **Synchronized Batch Norm**
+- [x] **Group Norm**
+- [x] **Modulated Deformable Convolution**
+- [x] **Deformable PSRoI Pooling**
+- [x] **Non-local和GCNet**
+**注意:** Synchronized batch normalization 只能在多GPU环境下使用，不能在CPU环境或者单GPU环境下使用。
+以下为选取各模型结构和骨干网络的代表模型COCO数据集精度mAP和单卡Tesla V100上预测速度(FPS)关系图。
+<div align="center">
+  <img src="docs/images/map_fps.png" width=800 />
+</div>
+**说明：**
+- `CBResNet`为`Cascade-Faster-RCNN-CBResNet200vd-FPN`模型，COCO数据集mAP高达53.3%
+- `Cascade-Faster-RCNN`为`Cascade-Faster-RCNN-ResNet50vd-DCN`，PaddleDetection将其优化到COCO数据mAP为47.8%时推理速度为20FPS
+- PaddleDetection增强版`YOLOv3-ResNet50vd-DCN`在COCO数据集mAP高于原作10.6个绝对百分点，推理速度为61.3FPS，快于原作约70%
+- 图中模型均可在[模型库](#模型库)中获取
+以下为PaddleDetection发布的精度和预测速度优于YOLOv4模型的PP-YOLO与前沿目标检测算法的COCO数据集精度与单卡Tesla V100预测速度(FPS)关系图， PP-YOLO模型在[COCO](http://cocodataset.org) test2019数据集上精度达到45.2%，在单卡V100上FP32推理速度为72.9 FPS，详细信息见[PP-YOLO模型](configs/ppyolo/README_cn.md)
+<div align="center">
+  <img src="docs/images/ppyolo_map_fps.png" width=600 />
+</div>
+## 文档教程
+### 入门教程
+- [安装说明](docs/tutorials/INSTALL_cn.md)
+- [快速开始](docs/tutorials/QUICK_STARTED_cn.md)
+- [训练/评估/预测流程](docs/tutorials/GETTING_STARTED_cn.md)
+- [如何训练自定义数据集](docs/tutorials/Custom_DataSet.md)
+- [常见问题汇总](docs/FAQ.md)
+### 进阶教程
+- [数据预处理及数据集定义](docs/advanced_tutorials/READER.md)
+- [搭建模型步骤](docs/advanced_tutorials/MODEL_TECHNICAL.md)
+- [模型参数配置](docs/advanced_tutorials/config_doc):
+  - [配置模块设计和介绍](docs/advanced_tutorials/config_doc/CONFIG_cn.md)
+  - [RCNN模型参数说明](docs/advanced_tutorials/config_doc/RCNN_PARAMS_DOC.md)
+- [迁移学习教程](docs/advanced_tutorials/TRANSFER_LEARNING_cn.md)
+- [IPython Notebook demo](demo/mask_rcnn_demo.ipynb)
+- [模型压缩](slim)
+    - [压缩benchmark](slim)
+    - [量化](slim/quantization)
+    - [剪枝](slim/prune)
+    - [蒸馏](slim/distillation)
+    - [神经网络搜索](slim/nas)
+- [推理部署](deploy)
+    - [模型导出教程](docs/advanced_tutorials/deploy/EXPORT_MODEL.md)
+    - [Python端推理部署](deploy/python)
+    - [C++端推理部署](deploy/cpp)
+    - [推理Benchmark](docs/advanced_tutorials/deploy/BENCHMARK_INFER_cn.md)
+## 模型库
+- [模型库](docs/MODEL_ZOO_cn.md)
+- [移动端模型](configs/mobile/README.md)
+- [Anchor free模型](configs/anchor_free/README.md)
+- [人脸检测模型](docs/featured_model/FACE_DETECTION.md)
+- [YOLOv3增强模型](docs/featured_model/YOLOv3_ENHANCEMENT.md): COCO mAP高达43.6%，原论文精度为33.0%
+- [PP-YOLO模型](configs/ppyolo/README_cn.md): COCO mAP高达45.3%，单卡Tesla V100预测速度高达72.9 FPS
+- [行人检测预训练模型](docs/featured_model/CONTRIB_cn.md)
+- [车辆检测预训练模型](docs/featured_model/CONTRIB_cn.md)
+- [Objects365 2019 Challenge夺冠模型](docs/featured_model/champion_model/CACascadeRCNN.md)
+- [Open Images 2019-Object Detction比赛最佳单模型](docs/featured_model/champion_model/OIDV5_BASELINE_MODEL.md)
+- [服务器端实用目标检测模型](configs/rcnn_enhance/README.md): V100上速度20FPS时，COCO mAP高达47.8%。
+- [大规模实用目标检测模型](docs/featured_model/LARGE_SCALE_DET_MODEL.md): 提供了包含676个类别的大规模服务器端实用目标检测模型，适用于绝大部分使用场景，可以直接用来预测，也可以用于微调其他任务。
+## 许可证书
+本项目的发布受[Apache 2.0 license](LICENSE)许可认证。
+## 版本更新
+v0.4.0版本已经在`07/2020`发布，增加PP-YOLO, TTFNet, HTC, ACFPN等多个模型，新增BlazeFace人脸关键点检测模型，新增移动端SSDLite系列优化模型，新增GridMask，RandomErasing数据增强方法，新增Matrix NMS和EMA训练，提升易用性，修复已知诸多bug等，详细内容请参考[版本更新文档](docs/CHANGELOG.md)。
+## 如何贡献代码
+我们非常欢迎你可以为PaddleDetection提供代码，也十分感谢你的反馈。
--- a/README_en.md
+++ b/README_en.md
-English | [简体中文](README.md)
-Documentation:[https://paddledetection.readthedocs.io](https://paddledetection.readthedocs.io)
-# PaddleDetection
-PaddleDetection is an end-to-end object detection development kit based on PaddlePaddle, which
-aims to help developers in the whole development of training models, optimizing performance and
-inference speed, and deploying models. PaddleDetection provides varied object detection architectures
-in modular design, and wealthy data augmentation methods, network components, loss functions, etc.
-PaddleDetection supported practical projects such as industrial quality inspection, remote sensing
-image object detection, and automatic inspection with its practical features such as model compression
-and multi-platform deployment.
-**Now all models in PaddleDetection require PaddlePaddle version 1.7 or higher, or suitable develop version.**
-<div align="center">
-  <img src="docs/images/000000570688.jpg" />
-</div>
-## Introduction
-Features:
- Rich models:
-  PaddleDetection provides rich of models, including 100+ pre-trained models
-such as object detection, instance segmentation, face detection etc. It covers
-the champion models, the practical detection models for cloud and edge device.
- Production Ready:
-  Key operations are implemented in C++ and CUDA, together with PaddlePaddle's
-highly efficient inference engine, enables easy deployment in server environments.
- Highly Flexible:
-  Components are designed to be modular. Model architectures, as well as data
-preprocess pipelines, can be easily customized with simple configuration
-changes.
- Performance Optimized:
-  With the help of the underlying PaddlePaddle framework, faster training and
-reduced GPU memory footprint is achieved. Notably, YOLOv3 training is
-much faster compared to other frameworks. Another example is Mask-RCNN
-(ResNet50), we managed to fit up to 4 images per GPU (Tesla V100 16GB) during
-multi-GPU training.
-Supported Architectures:
-|                     | ResNet | ResNet-vd <sup>[1](#vd)</sup> | ResNeXt-vd | SENet | MobileNet |  HRNet | Res2Net |
-| ------------------- | :----: | ----------------------------: | :--------: | :---: | :-------: |:------:|:-----:  |
-| Faster R-CNN        |   ✓    |                             ✓ |     x      |   ✓   |     ✗     |   ✗    |  ✗      |
-| Faster R-CNN + FPN  |   ✓    |                             ✓ |     ✓      |   ✓   |     ✗     |   ✓    |  ✓      |
-| Mask R-CNN          |   ✓    |                             ✓ |     x      |   ✓   |     ✗     |   ✗    |  ✗      |
-| Mask R-CNN + FPN    |   ✓    |                             ✓ |     ✓      |   ✓   |     ✗     |   ✗    |  ✓      |
-| Cascade Faster-RCNN |   ✓    |                             ✓ |     ✓      |   ✗   |     ✗     |   ✗    |  ✗      |
-| Cascade Mask-RCNN   |   ✓    |                             ✗ |     ✗      |   ✓   |     ✗     |   ✗    |  ✗      |
-| Libra R-CNN         |   ✗    |                             ✓ |     ✗      |   ✗   |     ✗     |   ✗    |  ✗      |
-| RetinaNet           |   ✓    |                             ✗ |     ✗      |   ✗   |     ✗     |   ✗    |  ✗      |
-| YOLOv3              |   ✓    |                             ✓ |     ✗      |   ✗   |     ✓     |   ✗    |  ✗      |
-| SSD                 |   ✗    |                             ✗ |     ✗      |   ✗   |     ✓     |   ✗    |  ✗      |
-| BlazeFace           |   ✗    |                             ✗ |     ✗      |   ✗   |     ✗     |   ✗    |  ✗      |
-| Faceboxes           |   ✗    |                             ✗ |     ✗      |   ✗   |     ✗     |   ✗    |  ✗      |
-<a name="vd">[1]</a> [ResNet-vd](https://arxiv.org/pdf/1812.01187) models offer much improved accuracy with negligible performance cost.
-**NOTE:** ✓ for config file and pretrain model provided in [Model Zoo](docs/MODEL_ZOO.md), ✗ for not provided but is supported generally.
-More models:
- EfficientDet
- FCOS
- CornerNet-Squeeze
- YOLOv4
-More Backbones:
- DarkNet
- VGG
- GCNet
- CBNet
-Advanced Features:
- [x] **Synchronized Batch Norm**
- [x] **Group Norm**
- [x] **Modulated Deformable Convolution**
- [x] **Deformable PSRoI Pooling**
- [x] **Non-local and GCNet**
-**NOTE:** Synchronized batch normalization can only be used on multiple GPU devices, can not be used on CPU devices or single GPU device.
-The following is the relationship between COCO mAP and FPS on Tesla V100 of representative models of each architectures and backbones.
-<div align="center">
-  <img src="docs/images/map_fps.png" />
-</div>
-**NOTE:**
- `CBResNet` stands for `Cascade-Faster-RCNN-CBResNet200vd-FPN`, which has highest mAP on COCO as 53.3% in PaddleDetection models
- `Cascade-Faster-RCNN` stands for `Cascade-Faster-RCNN-ResNet50vd-DCN`, which has been optimized to 20 FPS inference speed when COCO mAP as 47.8%
- The enhanced `YOLOv3-ResNet50vd-DCN` is 10.6 absolute percentage points higher than paper on COCO mAP, and inference speed is nearly 70% faster than the darknet framework
- All these models can be get in [Model Zoo](#Model-Zoo)
-## Tutorials
-### Get Started
- [Installation guide](docs/tutorials/INSTALL.md)
- [Quick start on small dataset](docs/tutorials/QUICK_STARTED.md)
- [Train/Evaluation/Inference](docs/tutorials/GETTING_STARTED.md)
- [How to train a custom dataset](docs/tutorials/Custom_DataSet.md)
- [FAQ](docs/FAQ.md)
-### Advanced Tutorial
- [Guide to preprocess pipeline and dataset definition](docs/advanced_tutorials/READER.md)
- [Models technical](docs/advanced_tutorials/MODEL_TECHNICAL.md)
- [Transfer learning document](docs/advanced_tutorials/TRANSFER_LEARNING.md)
- [Parameter configuration](docs/advanced_tutorials/config_doc):
-  - [Introduction to the configuration workflow](docs/advanced_tutorials/config_doc/CONFIG.md)
-  - [Parameter configuration for RCNN model](docs/advanced_tutorials/config_doc/RCNN_PARAMS_DOC.md)
- [IPython Notebook demo](demo/mask_rcnn_demo.ipynb)
- [Model compression](slim)
-    - [Model compression benchmark](slim)
-    - [Quantization](slim/quantization)
-    - [Model pruning](slim/prune)
-    - [Model distillation](slim/distillation)
-    - [Neural Architecture Search](slim/nas)
- [Deployment](deploy)
-    - [Export model for inference](docs/advanced_tutorials/deploy/EXPORT_MODEL.md)
-    - [Python inference](deploy/python)
-    - [C++ inference](deploy/cpp)
-    - [Inference benchmark](docs/advanced_tutorials/inference/BENCHMARK_INFER_cn.md)
-## Model Zoo
- Pretrained models are available in the [PaddleDetection model zoo](docs/MODEL_ZOO.md).
- [Mobile models](configs/mobile/README.md)
- [Anchor free models](configs/anchor_free/README.md)
- [Face detection models](docs/featured_model/FACE_DETECTION_en.md)
- [Pretrained models for pedestrian detection](docs/featured_model/CONTRIB.md)
- [Pretrained models for vehicle detection](docs/featured_model/CONTRIB.md)
- [YOLOv3 enhanced model](docs/featured_model/YOLOv3_ENHANCEMENT.md): Compared to MAP of 33.0% in paper, enhanced YOLOv3 reaches the MAP of 43.6%, and inference speed is improved as well
- [Objects365 2019 Challenge champion model](docs/featured_model/champion_model/CACascadeRCNN.md)
- [Best single model of Open Images 2019-Object Detction](docs/featured_model/champion_model/OIDV5_BASELINE_MODEL.md)
- [Practical Server-side detection method](configs/rcnn_enhance/README_en.md): Inference speed on single V100 GPU can reach 20FPS when COCO mAP is 47.8%.
- [Large-scale practical object detection models](docs/featured_model/LARGE_SCALE_DET_MODEL_en.md): Large-scale practical server-side detection pretrained models with 676 categories are provided for most application scenarios, which can be used not only for direct inference but also finetuning on other datasets.
-## License
-PaddleDetection is released under the [Apache 2.0 license](LICENSE).
-## Updates
-v0.3.0 was released at `05/2020`, add anchor-free, EfficientDet, YOLOv4, etc. Launched mobile and server-side practical and efficient multiple models. For example, the YOLOv3-MobileNetv3 mobile side model is accelerated 3.5 times, the server side has optimized the two-stage model, and the speed and accuracy have high cost performance. We also refactored predictive deployment functions, and improved ease of use, fix many known bugs, etc.
-Please refer to [版本更新文档](docs/CHANGELOG.md) for details.
-## Contributing
-Contributions are highly welcomed and we would really appreciate your feedback!!
--- a/configs/ppyolo/README.md
+++ b/configs/ppyolo/README.md
+English | [简体中文](README_cn.md)
+# PP-YOLO
+## Table of Contents
+- [Introduction](#Introduction)
+- [Model Zoo](#Model_Zoo)
+- [Getting Start](#Getting_Start)
+- [Future Work](#Future_Work)
+- [Appendix](#Appendix)
+## Introduction
+[PP-YOLO](https://arxiv.org/abs/2007.12099) is a optimized model based on YOLOv3 in PaddleDetection，whose performance(mAP on COCO) and inference spped are better than [YOLOv4](https://arxiv.org/abs/2004.10934)，PaddlePaddle 1.8.4(will release in mid-August 202) or [Daily Version](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/Tables.html#whl-dev) is required to run this PP-YOLO。
+PP-YOLO reached mmAP(IoU=0.5:0.95) as 45.9% on COCO test-dev2017 dataset, and inference speed of FP32 on single V100 is 72.9 FPS, inference speed of FP16 with TensorRT on single V100 is 155.6 FPS.
+<div align="center">
+  <img src="../../docs/images/ppyolo_map_fps.png" width=500 />
+</div>
+PP-YOLO improved performance and speed of YOLOv3 with following methods:
+- Better backbone: ResNet50vd-DCN
+- Larger training batch size: 8 GPUs and mini-batch size as 24 on each GPU
+- [Drop Block](https://arxiv.org/abs/1810.12890)
+- [Exponential Moving Average](https://www.investopedia.com/terms/e/ema.asp)
+- [IoU Loss](https://arxiv.org/pdf/1902.09630.pdf)
+- [Grid Sensitive](https://arxiv.org/abs/2004.10934)
+- [Matrix NMS](https://arxiv.org/pdf/2003.10152.pdf)
+- [CoordConv](https://arxiv.org/abs/1807.03247)
+- [Spatial Pyramid Pooling](https://arxiv.org/abs/1406.4729)
+- Better ImageNet pretrain weights
+## Model Zoo
+### PP-YOLO
+|          Model           | GPU number | images/GPU |  backbone  | input shape | Box AP<sup>test</sup> | V100 FP32(FPS) | V100 TensorRT FP16(FPS) | download | config  |
+|:------------------------:|:----------:|:----------:|:----------:| :----------:| :-------------------: | :------------: | :---------------------: | :------: | :-----: |
+| YOLOv4(AlexyAB)          |     -      |      -     | CSPDarknet |     608     |         43.5          |       62       |          105.5          | [model](https://paddlemodels.bj.bcebos.com/object_detection/yolov4_cspdarknet.pdparams) | [config](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/yolov4/yolov4_csdarknet.yml)                   |
+| YOLOv4(AlexyAB)          |     -      |      -     | CSPDarknet |     512     |         43.0          |       83       |          138.4          | [model](https://paddlemodels.bj.bcebos.com/object_detection/yolov4_cspdarknet.pdparams) | [config](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/yolov4/yolov4_csdarknet.yml)                   |
+| YOLOv4(AlexyAB)          |     -      |      -     | CSPDarknet |     416     |         41.2          |       96       |          164.0          | [model](https://paddlemodels.bj.bcebos.com/object_detection/yolov4_cspdarknet.pdparams) | [config](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/yolov4/yolov4_csdarknet.yml)                   |
+| YOLOv4(AlexyAB)          |     -      |      -     | CSPDarknet |     320     |         38.0          |      123       |          199.0          | [model](https://paddlemodels.bj.bcebos.com/object_detection/yolov4_cspdarknet.pdparams) | [config](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/yolov4/yolov4_csdarknet.yml)                   |
+| PP-YOLO                  |     8      |     24    | ResNet50vd  |     608     |         45.2          |      72.9      |          155.6          | [model](https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams) | [config](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/ppyolo/ppyolo.yml)                   |
+| PP-YOLO                  |     8      |     24    | ResNet50vd  |     512     |         44.4          |      89.9      |          188.4          | [model](https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams) | [config](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/ppyolo/ppyolo.yml)                   |
+| PP-YOLO                  |     8      |     24    | ResNet50vd  |     416     |         42.5          |     109.1      |          215.4          | [model](https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams) | [config](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/ppyolo/ppyolo.yml)                   |
+| PP-YOLO                  |     8      |     24    | ResNet50vd  |     320     |         39.3          |     132.2      |          242.2          | [model](https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams) | [config](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/ppyolo/ppyolo.yml)                   |
+| PP-YOLO_2x               |     8      |     24    | ResNet50vd  |     608     |         45.9          |      72.9      |          155.6          | [model](https://paddlemodels.bj.bcebos.com/object_detection/ppyolo_2x.pdparams) | [config](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/ppyolo/ppyolo_2x.yml)                   |
+| PP-YOLO_2x               |     8      |     24    | ResNet50vd  |     512     |         45.0          |      89.9      |          188.4          | [model](https://paddlemodels.bj.bcebos.com/object_detection/ppyolo_2x.pdparams) | [config](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/ppyolo/ppyolo_2x.yml)                   |
+| PP-YOLO_2x               |     8      |     24    | ResNet50vd  |     416     |         43.2          |     109.1      |          215.4          | [model](https://paddlemodels.bj.bcebos.com/object_detection/ppyolo_2x.pdparams) | [config](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/ppyolo/ppyolo_2x.yml)                   |
+| PP-YOLO_2x               |     8      |     24    | ResNet50vd  |     320     |         40.1          |     132.2      |          242.2          | [model](https://paddlemodels.bj.bcebos.com/object_detection/ppyolo_2x.pdparams) | [config](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/ppyolo/ppyolo_2x.yml)                   |
+**Notes:**
+- PP-YOLO is trained on COCO train2017 datast and evaluated on test-dev2017 dataset，Box AP<sup>test</sup> is evaluation results of `mAP(IoU=0.5:0.95)`.
+- PP-YOLO used 8 GPUs for training and mini-batch size as 24 on each GPU, if GPU number and mini-batch size is changed, learning rate and iteration times should be adjusted according [FAQ](../../docs/FAQ.md).
+- PP-YOLO inference speed is tesed on single Tesla V100 with batch size as 1, CUDA 10.2, CUDNN 7.5.1, TensorRT 5.1.2.2 in TensorRT mode.
+- PP-YOLO FP32 inference speed testing uses inference model exported by `tools/export_model.py` and benchmarked by running `depoly/python/infer.py` with `--run_benchmark`. All testing results do not contains the time cost of data reading and post-processing(NMS), which is same as [YOLOv4(AlexyAB)](https://github.com/AlexeyAB/darknet) in testing method.
+- TensorRT FP16 inference speed testing exclude the time cost of bounding-box decoding(`yolo_box`) part comparing with FP32 testing above, which means that data reading, bounding-box decoding and post-processing(NMS) is excluded(test method same as [YOLOv4(AlexyAB)](https://github.com/AlexeyAB/darknet) too)
+- YOLOv4(AlexyAB) performance and inference speed is copy from single Tesla V100 testing results in [YOLOv4 github repo](https://github.com/AlexeyAB/darknet), Tesla V100 TensorRT FP16 inference speed is testing with tkDNN configuration and TensorRT 5.1.2.2 on single Tesla V100 based on [AlexyAB/darknet repo](https://github.com/AlexeyAB/darknet).
+- Download and configuration of YOLOv4(AlexyAB) is reproduced model of YOLOv4 in PaddleDetection, whose evaluation performance is same as YOLOv4(AlexyAB), and finetune training is supported in PaddleDetection currently, reproducing by training from backbone pretrain weights is on working, see [PaddleDetection YOLOv4](../yolov4/README.md) for details.
+### PP-YOLO tiny
+|          Model           | GPU number | images/GPU |  backbone  | input shape | Box AP50<sup>val</sup> | V100 FP32(FPS) | V100 TensorRT FP16(FPS) | download | config  |
+|:------------------------:|:----------:|:----------:|:----------:| :----------:| :--------------------: | :------------: | :---------------------: | :------: | :-----: |
+| PP-YOLO tiny             |     4      |      32    | ResNet18vd |     416     |          47.0          |     401.6      |          724.6          | [model](https://paddlemodels.bj.bcebos.com/object_detection/ppyolo_tiny.pdparams) | [config](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/ppyolo/ppyolo_tiny.yml)                   |
+| PP-YOLO tiny             |     4      |      32    | ResNet18vd |     320     |          43.7          |     478.5      |          791.3          | [model](https://paddlemodels.bj.bcebos.com/object_detection/ppyolo_tiny.pdparams) | [config](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/ppyolo/ppyolo_tiny.yml)                   |
+- PP-YOLO tiny is trained on COCO train2017 datast and evaluated on val2017 dataset，Box AP50<sup>val</sup> is evaluation results of `mAP(IoU=0.5)`.
+- PP-YOLO tiny used 4 GPUs for training and mini-batch size as 32 on each GPU, if GPU number and mini-batch size is changed, learning rate and iteration times should be adjusted according [FAQ](../../docs/FAQ.md).
+- PP-YOLO tiny inference speeding testing environment and configuration is same as PP-YOLO above.
+## Getting Start
+### 1. Training
+Training PP-YOLO on 8 GPUs with following command(all commands should be run under PaddleDetection root directory as default), use `--eval` to enable alternate evaluation during training.
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python tools/train.py -c configs/ppyolo/ppyolo.yml --eval
+```
+### 2. Evaluation
+Evaluating PP-YOLO on COCO val2017 dataset in single GPU with following commands:
+```bash
+# use weights released in PaddleDetection model zoo
+CUDA_VISIBLE_DEVICES=0 python tools/eval.py -c configs/ppyolo/ppyolo.yml -o weights=https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams
+# use saved checkpoint in training
+CUDA_VISIBLE_DEVICES=0 python tools/eval.py -c configs/ppyolo/ppyolo.yml -o weights=output/ppyolo/best_model
+```
+For evaluation on COCO test-dev2017 dataset, `configs/ppyolo/ppyolo_test.yml` should be used, please download COCO test-dev2017 dataset from [COCO dataset download](https://cocodataset.org/#download) and decompress to pathes configured by `EvalReader.dataset` in `configs/ppyolo/ppyolo_test.yml` and run evaluation by following command:
+```bash
+# use weights released in PaddleDetection model zoo
+CUDA_VISIBLE_DEVICES=0 python tools/eval.py -c configs/ppyolo/ppyolo_test.yml -o weights=https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams
+# use saved checkpoint in training
+CUDA_VISIBLE_DEVICES=0 python tools/eval.py -c configs/ppyolo/ppyolo_test.yml -o weights=output/ppyolo/best_model
+```
+Evaluation results will be saved in `bbox.json`, compress it into a `zip` package and upload to [COCO dataset evaluation](https://competitions.codalab.org/competitions/20794#participate) to evaluate.
+**NOTE:** `configs/ppyolo/ppyolo_test.yml` is only used for evaluation on COCO test-dev2017 dataset, could not be used for training or COCO val2017 dataset evaluating.
+### 3. Inference
+Inference images in single GPU with following commands, use `--infer_img` to inference a single image and `--infer_dir` to inference all images in the directory.
+```bash
+# inference single image
+CUDA_VISIBLE_DEVICES=0 python tools/infer.py -c configs/ppyolo/ppyolo.yml -o weights=https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams --infer_img=demo/000000014439_640x640.jpg
+# inference all images in the directory
+CUDA_VISIBLE_DEVICES=0 python tools/infer.py -c configs/ppyolo/ppyolo.yml -o weights=https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams --infer_dir=demo
+```
+### 4. Inferece deployment and benchmark
+For inference deployment or benchmard, model exported with `tools/export_model.py` should be used and perform inference with Paddle inference library with following commands:
+```bash
+# export model, model will be save in output/ppyolo as default
+python tools/export_model.py -c configs/ppyolo/ppyolo.yml -o weights=https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams
+# inference with Paddle Inference library
+CUDA_VISIBLE_DEVICES=0 python deploy/python/infer.py --model_dir=output/ppyolo --image_file=demo/000000014439_640x640.jpg --use_gpu=True
+```
+Benchmark testing for PP-YOLO uses model without data reading and post-processing(NMS), export model with `--exclude_nms` to prunce NMS for benchmark testing from mode with following commands:
+```bash
+# export model, --exclude_nms to prune NMS part, model will be save in output/ppyolo as default
+python tools/export_model.py -c configs/ppyolo/ppyolo.yml -o weights=https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams --exclude_nms
+# FP32 benchmark
+CUDA_VISIBLE_DEVICES=0 python deploy/python/infer.py --model_dir=output/ppyolo --image_file=demo/000000014439_640x640.jpg --use_gpu=True --run_benchmark=True
+# TensorRT FP16 benchmark
+CUDA_VISIBLE_DEVICES=0 python deploy/python/infer.py --model_dir=output/ppyolo --image_file=demo/000000014439_640x640.jpg --use_gpu=True --run_benchmark=True --run_mode=trt_fp16
+```
+## Future work
+1. more PP-YOLO tiny model
+2. PP-YOLO model with more backbones
+## Appendix
+Optimizing method and ablation experiments of PP-YOLO compared with YOLOv3.
+| NO.  |        Model                 | Box AP<sup>val</sup> | Box AP<sup>test</sup> | Params(M) | FLOPs(G) | V100 FP32 FPS |
+| :--: | :--------------------------- | :------------------: |:--------------------: | :-------: | :------: | :-----------: |
+|  A   | YOLOv3-DarkNet53             |         38.9         |           -           |   59.13   |  65.52   |      58.2     |
+|  B   | YOLOv3-ResNet50vd-DCN        |         39.1         |           -           |   43.89   |  44.71   |      79.2     |
+|  C   | B + LB + EMA + DropBlock     |         41.4         |           -           |   43.89   |  44.71   |      79.2     |
+|  D   | C + IoU Loss                 |         41.9         |           -           |   43.89   |  44.71   |      79.2     |
+|  E   | D + IoU Aware                |         42.5         |           -           |   43.90   |  44.71   |      74.9     |
+|  F   | E + Grid Sensitive           |         42.8         |           -           |   43.90   |  44.71   |      74.8     |
+|  G   | F + Matrix NMS               |         43.5         |           -           |   43.90   |  44.71   |      74.8     |
+|  H   | G + CoordConv                |         44.0         |           -           |   43.93   |  44.76   |      74.1     |
+|  I   | H + SPP                      |         44.3         |         45.2          |   44.93   |  45.12   |      72.9     |
+|  J   | I + Better ImageNet Pretrain |         44.6         |         45.2          |   44.93   |  45.12   |      72.9     |
+**Notes:**
+- Performance and inference spedd are measure with input shape as 608
+- All models are trained on COCO train2017 datast and evaluated on val2017 & test-dev2017 dataset，`Box AP` is evaluation results as `mAP(IoU=0.5:0.95)`.
+- Inference speed is tested on single Tesla V100 with batch size as 1 following test method and environment configuration in benchmark above.
+- [YOLOv3-DarkNet53](../yolov3_darknet.yml) with mAP as 38.9 is optimized YOLOv3 model in PaddleDetection，see [Model Zoo](../../docs/MODEL_ZOO.md) for details.
--- a/configs/ppyolo/README_cn.md
+++ b/configs/ppyolo/README_cn.md
+简体中文 | [English](README.md)
+# PP-YOLO 模型
+## 内容
+- [简介](#简介)
+- [模型库与基线](#模型库与基线)
+- [使用说明](#使用说明)
+- [未来工作](#未来工作)
+- [附录](#附录)
+## 简介
+[PP-YOLO](https://arxiv.org/abs/2007.12099)是PaddleDetection优化和改进的YOLOv3的模型，其精度(COCO数据集mAP)和推理速度均优于[YOLOv4](https://arxiv.org/abs/2004.10934)模型，要求使用PaddlePaddle 1.8.4(2020年8月中旬发布)或适当的[develop版本](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/Tables.html#whl-dev)。
+PP-YOLO在[COCO](http://cocodataset.org) test-dev2017数据集上精度达到45.9%，在单卡V100上FP32推理速度为72.9 FPS, V100上开启TensorRT下FP16推理速度为155.6 FPS。
+<div align="center">
+  <img src="../../docs/images/ppyolo_map_fps.png" width=500 />
+</div>
+PP-YOLO从如下方面优化和提升YOLOv3模型的精度和速度：
+- 更优的骨干网络: ResNet50vd-DCN
+- 更大的训练batch size: 8 GPUs，每GPU batch_size=24，对应调整学习率和迭代轮数
+- [Drop Block](https://arxiv.org/abs/1810.12890)
+- [Exponential Moving Average](https://www.investopedia.com/terms/e/ema.asp)
+- [IoU Loss](https://arxiv.org/pdf/1902.09630.pdf)
+- [Grid Sensitive](https://arxiv.org/abs/2004.10934)
+- [Matrix NMS](https://arxiv.org/pdf/2003.10152.pdf)
+- [CoordConv](https://arxiv.org/abs/1807.03247)
+- [Spatial Pyramid Pooling](https://arxiv.org/abs/1406.4729)
+- 更优的预训练模型
+## 模型库
+### PP-YOLO模型
+|          模型            | GPU个数 | 每GPU图片个数 |  骨干网络  | 输入尺寸 | Box AP<sup>test</sup> | V100 FP32(FPS) | V100 TensorRT FP16(FPS) | 模型下载 | 配置文件 |
+|:------------------------:|:-------:|:-------------:|:----------:| :-------:| :-------------------: | :------------: | :---------------------: | :------: | :------: |
+| YOLOv4(AlexyAB)          |    -    |       -       | CSPDarknet |   608    |         43.5          |       62       |          105.5           | [下载链接](https://paddlemodels.bj.bcebos.com/object_detection/yolov4_cspdarknet.pdparams) |  [配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/yolov4/yolov4_csdarknet.yml)                   |
+| YOLOv4(AlexyAB)          |    -    |       -       | CSPDarknet |   512    |         43.0          |       83       |          138.4           | [下载链接](https://paddlemodels.bj.bcebos.com/object_detection/yolov4_cspdarknet.pdparams) |  [配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/yolov4/yolov4_csdarknet.yml)                   |
+| YOLOv4(AlexyAB)          |    -    |       -       | CSPDarknet |   416    |         41.2          |       96       |          164.0           | [下载链接](https://paddlemodels.bj.bcebos.com/object_detection/yolov4_cspdarknet.pdparams) |  [配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/yolov4/yolov4_csdarknet.yml)                   |
+| YOLOv4(AlexyAB)          |    -    |       -       | CSPDarknet |   320    |         38.0          |      123       |          199.0           | [下载链接](https://paddlemodels.bj.bcebos.com/object_detection/yolov4_cspdarknet.pdparams) |  [配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/yolov4/yolov4_csdarknet.yml)                   |
+| PP-YOLO                   |    8    |      24      | ResNet50vd |   608    |         45.2          |      72.9      |          155.6          | [下载链接](https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams) |  [配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/ppyolo/ppyolo.yml)                   |
+| PP-YOLO                   |    8    |      24      | ResNet50vd |   512    |         44.4          |      89.9      |          188.4          | [下载链接](https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams) |  [配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/ppyolo/ppyolo.yml)                   |
+| PP-YOLO                   |    8    |      24      | ResNet50vd |   416    |         42.5          |     109.1      |          215.4          | [下载链接](https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams) |  [配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/ppyolo/ppyolo.yml)                   |
+| PP-YOLO                   |    8    |      24      | ResNet50vd |   320    |         39.3          |     132.2      |          242.2          | [下载链接](https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams) |  [配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/ppyolo/ppyolo.yml)                   |
+| PP-YOLO_2x                |    8    |      24      | ResNet50vd |   608    |         45.9          |      72.9      |          155.6          | [下载链接](https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams) |  [配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/ppyolo/ppyolo.yml)                   |
+| PP-YOLO_2x                |    8    |      24      | ResNet50vd |   512    |         45.0          |      89.9      |          188.4          | [下载链接](https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams) |  [配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/ppyolo/ppyolo.yml)                   |
+| PP-YOLO_2x                |    8    |      24      | ResNet50vd |   416    |         43.2          |     109.1      |          215.4          | [下载链接](https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams) |  [配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/ppyolo/ppyolo.yml)                   |
+| PP-YOLO_2x                |    8    |      24      | ResNet50vd |   320    |         40.1          |     132.2      |          242.2          | [下载链接](https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams) |  [配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/ppyolo/ppyolo.yml)                   |
+**注意:**
+- PP-YOLO模型使用COCO数据集中train2017作为训练集，使用test-dev2017作为测试集，Box AP<sup>test</sup>为`mAP(IoU=0.5:0.95)`评估结果。
+- PP-YOLO模型训练过程中使用8 GPUs，每GPU batch size为24进行训练，如训练GPU数和batch size不使用上述配置，须参考[FAQ](../../docs/FAQ.md)调整学习率和迭代次数。
+- PP-YOLO模型推理速度测试采用单卡V100，batch size=1进行测试，使用CUDA 10.2, CUDNN 7.5.1，TensorRT推理速度测试使用TensorRT 5.1.2.2。
+- PP-YOLO模型FP32的推理速度测试数据为使用`tools/export_model.py`脚本导出模型后，使用`deploy/python/infer.py`脚本中的`--run_benchnark`参数使用Paddle预测库进行推理速度benchmark测试结果, 且测试的均为不包含数据预处理和模型输出后处理(NMS)的数据(与[YOLOv4(AlexyAB)](https://github.com/AlexeyAB/darknet)测试方法一致)。
+- TensorRT FP16的速度测试相比于FP32去除了`yolo_box`(bbox解码)部分耗时，即不包含数据预处理，bbox解码和NMS(与[YOLOv4(AlexyAB)](https://github.com/AlexeyAB/darknet)测试方法一致)。
+- YOLOv4(AlexyAB)模型精度和V100 FP32推理速度数据使用[YOLOv4 github库](https://github.com/AlexeyAB/darknet)提供的单卡V100上精度速度测试数据，V100 TensorRT FP16推理速度为使用[AlexyAB/darknet](https://github.com/AlexeyAB/darknet)库中tkDNN配置于单卡V100，TensorRT 5.1.2.2的测试结果。
+- PP-YOLO模型推理速度测试采用单卡V100，batch size=1进行测试，使用CUDA 10.2, CUDNN 7.5.1，TensorRT推理速度测试使用TensorRT 5.1.2.2。
+- YOLOv4(AlexyAB)行`模型下载`和`配置文件`为PaddleDetection复现的YOLOv4模型，目前评估精度已对齐，支持finetune，训练精度对齐中，可参见[PaddleDetection YOLOv4 模型](../yolov4/README.md)
+### PP-YOLO tiny模型
+|          模型            | GPU个数 | 每GPU图片个数 |  骨干网络  | 输入尺寸 | Box AP50<sup>val</sup> | V100 FP32(FPS) | V100 TensorRT FP16(FPS) | 模型下载 | 配置文件 |
+|:------------------------:|:-------:|:-------------:|:----------:| :-------:| :------------------: | :------------: | :---------------------: | :------: | :------: |
+| PP-YOLO tiny              |    4    |      32      | ResNet18vd |   416    |         47.0         |     401.6      |          724.6          | [下载链接](https://paddlemodels.bj.bcebos.com/object_detection/ppyolo_tiny.pdparams) |  [配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/ppyolo/ppyolo_tiny.yml)                   |
+| PP-YOLO tiny              |    4    |      32      | ResNet18vd |   320    |         43.7         |     478.5      |          791.3          | [下载链接](https://paddlemodels.bj.bcebos.com/object_detection/ppyolo_tiny.pdparams) |  [配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/ppyolo/ppyolo_tiny.yml)                   |
+- PP-YOLO tiny模型使用COCO数据集中train2017作为训练集，使用val2017作为测试集，Box AP50<sup>val</sup>为`mAP(IoU=0.5)`评估结果。
+- PP-YOLO tiny模型训练过程中使用4GPU，每GPU batch size为32进行训练，如训练GPU数和batch size不使用上述配置，须参考[FAQ](../../docs/FAQ.md)调整学习率和迭代次数。
+- PP-YOLO tiny模型推理速度测试环境配置和测试方法与PP-YOLO模型一致。
+## 使用说明
+### 1. 训练
+使用8GPU通过如下命令一键式启动训练(以下命令均默认在PaddleDetection根目录运行), 通过`--eval`参数开启训练中交替评估。
+```bash
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python tools/train.py -c configs/ppyolo/ppyolo.yml --eval
+```
+### 2. 评估
+使用单GPU通过如下命令一键式评估模型在COCO val2017数据集效果
+```bash
+# 使用PaddleDetection发布的权重
+CUDA_VISIBLE_DEVICES=0 python tools/eval.py -c configs/ppyolo/ppyolo.yml -o weights=https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams
+# 使用训练保存的checkpoint
+CUDA_VISIBLE_DEVICES=0 python tools/eval.py -c configs/ppyolo/ppyolo.yml -o weights=output/ppyolo/best_model
+```
+我们提供了`configs/ppyolo/ppyolo_test.yml`用于评估COCO test-dev2017数据集的效果，评估COCO test-dev2017数据集的效果须先从[COCO数据集下载页](https://cocodataset.org/#download)下载test-dev2017数据集，解压到`configs/ppyolo/ppyolo_test.yml`中`EvalReader.dataset`中配置的路径，并使用如下命令进行评估
+```bash
+# 使用PaddleDetection发布的权重
+CUDA_VISIBLE_DEVICES=0 python tools/eval.py -c configs/ppyolo/ppyolo_test.yml -o weights=https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams
+# 使用训练保存的checkpoint
+CUDA_VISIBLE_DEVICES=0 python tools/eval.py -c configs/ppyolo/ppyolo_test.yml -o weights=output/ppyolo/best_model
+```
+评估结果保存于`bbox.json`中，将其压缩为zip包后通过[COCO数据集评估页](https://competitions.codalab.org/competitions/20794#participate)提交评估。
+**注意:** `configs/ppyolo/ppyolo_test.yml`仅用于评估COCO test-dev数据集，不用于训练和评估COCO val2017数据集。
+### 3. 推理
+使用单GPU通过如下命令一键式推理图像，通过`--infer_img`指定图像路径，或通过`--infer_dir`指定目录并推理目录下所有图像
+```bash
+# 推理单张图像
+CUDA_VISIBLE_DEVICES=0 python tools/infer.py -c configs/ppyolo/ppyolo.yml -o weights=https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams --infer_img=demo/000000014439_640x640.jpg
+# 推理目录下所有图像
+CUDA_VISIBLE_DEVICES=0 python tools/infer.py -c configs/ppyolo/ppyolo.yml -o weights=https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams --infer_dir=demo
+```
+### 4. 推理部署与benchmark
+PP-YOLO模型部署及推理benchmark需要通过`tools/export_model.py`导出模型后使用Paddle预测库进行部署和推理，可通过如下命令一键式启动。
+```bash
+# 导出模型，默认存储于output/ppyolo目录
+python tools/export_model.py -c configs/ppyolo/ppyolo.yml -o weights=https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams
+# 预测库推理
+CUDA_VISIBLE_DEVICES=0 python deploy/python/infer.py --model_dir=output/ppyolo --image_file=demo/000000014439_640x640.jpg --use_gpu=True
+```
+PP-YOLO模型benchmark测试为不包含数据预处理和网络输出后处理(NMS)的网络结构部分数据，导出模型时须指定`--exlcude_nms`来裁剪掉模型中后处理的NMS部分，通过如下命令进行模型导出和benchmark测试。
+```bash
+# 导出模型，通过--exclude_nms参数裁剪掉模型中的NMS部分，默认存储于output/ppyolo目录
+python tools/export_model.py -c configs/ppyolo/ppyolo.yml -o weights=https://paddlemodels.bj.bcebos.com/object_detection/ppyolo.pdparams --exclude_nms
+# FP32 benchmark测试
+CUDA_VISIBLE_DEVICES=0 python deploy/python/infer.py --model_dir=output/ppyolo --image_file=demo/000000014439_640x640.jpg --use_gpu=True --run_benchmark=True
+# TensorRT FP16 benchmark测试
+CUDA_VISIBLE_DEVICES=0 python deploy/python/infer.py --model_dir=output/ppyolo --image_file=demo/000000014439_640x640.jpg --use_gpu=True --run_benchmark=True --run_mode=trt_fp16
+```
+## 未来工作
+1. 发布PP-YOLO-tiny模型
+2. 发布更多骨干网络的PP-YOLO及PP-YOLO-tiny模型
+## 附录
+PP-YOLO模型相对于YOLOv3模型优化项消融实验数据如下表所示。
+| 序号 |        模型                  | Box AP<sup>val</sup> | Box AP<sup>test</sup> | 参数量(M) | FLOPs(G) | V100 FP32 FPS |
+| :--: | :--------------------------- | :------------------: | :-------------------: | :-------: | :------: | :-----------: |
+|  A   | YOLOv3-DarkNet53             |         38.9         |            -          |   59.13   |  65.52   |      58.2     |
+|  B   | YOLOv3-ResNet50vd-DCN        |         39.1         |            -          |   43.89   |  44.71   |      79.2     |
+|  C   | B + LB + EMA + DropBlock     |         41.4         |            -          |   43.89   |  44.71   |      79.2     |
+|  D   | C + IoU Loss                 |         41.9         |            -          |   43.89   |  44.71   |      79.2     |
+|  E   | D + IoU Aware                |         42.5         |            -          |   43.90   |  44.71   |      74.9     |
+|  F   | E + Grid Sensitive           |         42.8         |            -          |   43.90   |  44.71   |      74.8     |
+|  G   | F + Matrix NMS               |         43.5         |            -          |   43.90   |  44.71   |      74.8     |
+|  H   | G + CoordConv                |         44.0         |            -          |   43.93   |  44.76   |      74.1     |
+|  I   | H + SPP                      |         44.3         |          45.2         |   44.93   |  45.12   |      72.9     |
+|  J   | I + Better ImageNet Pretrain |         44.6         |          45.2         |   44.93   |  45.12   |      72.9     |
+**注意:**
+- 精度与推理速度数据均为使用输入图像尺寸为608的测试结果
+- Box AP为在COCO train2017数据集训练，val2017和test-dev2017数据集上评估`mAP(IoU=0.5:0.95)`数据
+- 推理速度为单卡V100上，batch size=1, 使用上述benchmark测试方法的测试结果，测试环境配置为CUDA 10.2，CUDNN 7.5.1
+- [YOLOv3-DarkNet53](../yolov3_darknet.yml)精度38.9为PaddleDetection优化后的YOLOv3模型，可参见[模型库](../../docs/MODEL_ZOO_cn.md)
--- a/configs/ppyolo/ppyolo.yml
+++ b/configs/ppyolo/ppyolo.yml
+architecture: YOLOv3
+use_gpu: true
+max_iters: 250000
+log_smooth_window: 100
+log_iter: 100
+save_dir: output
+snapshot_iter: 10000
+metric: COCO
+pretrain_weights: https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_vd_ssld_pretrained.tar
+weights: output/ppyolo/model_final
+num_classes: 80
+use_fine_grained_loss: true
+use_ema: true
+ema_decay: 0.9998
+YOLOv3:
+  backbone: ResNet
+  yolo_head: YOLOv3Head
+  use_fine_grained_loss: true
+ResNet:
+  norm_type: sync_bn
+  freeze_at: 0
+  freeze_norm: false
+  norm_decay: 0.
+  depth: 50
+  feature_maps: [3, 4, 5]
+  variant: d
+  dcn_v2_stages: [5]
+YOLOv3Head:
+  anchor_masks: [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
+  anchors: [[10, 13], [16, 30], [33, 23],
+            [30, 61], [62, 45], [59, 119],
+            [116, 90], [156, 198], [373, 326]]
+  norm_decay: 0.
+  coord_conv: true
+  iou_aware: true
+  iou_aware_factor: 0.4
+  scale_x_y: 1.05
+  spp: true
+  yolo_loss: YOLOv3Loss
+  nms: MatrixNMS
+  drop_block: true
+YOLOv3Loss:
+  batch_size: 24
+  ignore_thresh: 0.7
+  scale_x_y: 1.05
+  label_smooth: false
+  use_fine_grained_loss: true
+  iou_loss: IouLoss
+  iou_aware_loss: IouAwareLoss
+IouLoss:
+  loss_weight: 2.5
+  max_height: 608
+  max_width: 608
+IouAwareLoss:
+  loss_weight: 1.0
+  max_height: 608
+  max_width: 608
+MatrixNMS:
+    background_label: -1
+    keep_top_k: 100
+    normalized: false
+    score_threshold: 0.01
+    post_threshold: 0.01
+LearningRate:
+  base_lr: 0.01
+  schedulers:
+  - !PiecewiseDecay
+    gamma: 0.1
+    milestones:
+    - 150000
+    - 200000
+  - !LinearWarmup
+    start_factor: 0.
+    steps: 4000
+OptimizerBuilder:
+  optimizer:
+    momentum: 0.9
+    type: Momentum
+  regularizer:
+    factor: 0.0005
+    type: L2
+_READER_: 'ppyolo_reader.yml'
--- a/configs/ppyolo/ppyolo_2x.yml
+++ b/configs/ppyolo/ppyolo_2x.yml
+architecture: YOLOv3
+use_gpu: true
+max_iters: 500000
+log_smooth_window: 100
+log_iter: 100
+save_dir: output
+snapshot_iter: 10000
+metric: COCO
+pretrain_weights: https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_vd_ssld_pretrained.tar
+weights: output/ppyolo/model_final
+num_classes: 80
+use_fine_grained_loss: true
+use_ema: true
+ema_decay: 0.9998
+YOLOv3:
+  backbone: ResNet
+  yolo_head: YOLOv3Head
+  use_fine_grained_loss: true
+ResNet:
+  norm_type: sync_bn
+  freeze_at: 0
+  freeze_norm: false
+  norm_decay: 0.
+  depth: 50
+  feature_maps: [3, 4, 5]
+  variant: d
+  dcn_v2_stages: [5]
+YOLOv3Head:
+  anchor_masks: [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
+  anchors: [[10, 13], [16, 30], [33, 23],
+            [30, 61], [62, 45], [59, 119],
+            [116, 90], [156, 198], [373, 326]]
+  norm_decay: 0.
+  coord_conv: true
+  iou_aware: true
+  iou_aware_factor: 0.4
+  scale_x_y: 1.05
+  spp: true
+  yolo_loss: YOLOv3Loss
+  nms: MatrixNMS
+  drop_block: true
+YOLOv3Loss:
+  batch_size: 24
+  ignore_thresh: 0.7
+  scale_x_y: 1.05
+  label_smooth: false
+  use_fine_grained_loss: true
+  iou_loss: IouLoss
+  iou_aware_loss: IouAwareLoss
+IouLoss:
+  loss_weight: 2.5
+  max_height: 608
+  max_width: 608
+IouAwareLoss:
+  loss_weight: 1.0
+  max_height: 608
+  max_width: 608
+MatrixNMS:
+    background_label: -1
+    keep_top_k: 100
+    normalized: false
+    score_threshold: 0.01
+    post_threshold: 0.01
+LearningRate:
+  base_lr: 0.01
+  schedulers:
+  - !PiecewiseDecay
+    gamma: 0.1
+    milestones:
+    - 400000
+    - 450000
+  - !LinearWarmup
+    start_factor: 0.
+    steps: 4000
+OptimizerBuilder:
+  optimizer:
+    momentum: 0.9
+    type: Momentum
+  regularizer:
+    factor: 0.0005
+    type: L2
+_READER_: 'ppyolo_reader.yml'
--- a/configs/ppyolo/ppyolo_reader.yml
+++ b/configs/ppyolo/ppyolo_reader.yml
+TrainReader:
+  inputs_def:
+    fields: ['image', 'gt_bbox', 'gt_class', 'gt_score']
+    num_max_boxes: 50
+  dataset:
+    !COCODataSet
+      image_dir: train2017
+      anno_path: annotations/instances_train2017.json
+      dataset_dir: dataset/coco
+      with_background: false
+  sample_transforms:
+    - !DecodeImage
+      to_rgb: True
+      with_mixup: True
+    - !MixupImage
+      alpha: 1.5
+      beta: 1.5
+    - !ColorDistort {}
+    - !RandomExpand
+      fill_value: [123.675, 116.28, 103.53]
+    - !RandomCrop {}
+    - !RandomFlipImage
+      is_normalized: false
+    - !NormalizeBox {}
+    - !PadBox
+      num_max_boxes: 50
+    - !BboxXYXY2XYWH {}
+  batch_transforms:
+  - !RandomShape
+    sizes: [320, 352, 384, 416, 448, 480, 512, 544, 576, 608]
+    random_inter: True
+  - !NormalizeImage
+    mean: [0.485, 0.456, 0.406]
+    std: [0.229, 0.224, 0.225]
+    is_scale: True
+    is_channel_first: false
+  - !Permute
+    to_bgr: false
+    channel_first: True
+  # Gt2YoloTarget is only used when use_fine_grained_loss set as true,
+  # this operator will be deleted automatically if use_fine_grained_loss
+  # is set as false
+  - !Gt2YoloTarget
+    anchor_masks: [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
+    anchors: [[10, 13], [16, 30], [33, 23],
+              [30, 61], [62, 45], [59, 119],
+              [116, 90], [156, 198], [373, 326]]
+    downsample_ratios: [32, 16, 8]
+  batch_size: 24
+  shuffle: true
+  mixup_epoch: 25000
+  drop_last: true
+  worker_num: 8
+  bufsize: 4
+  use_process: true
+EvalReader:
+  inputs_def:
+    fields: ['image', 'im_size', 'im_id']
+    num_max_boxes: 50
+  dataset:
+    !COCODataSet
+      image_dir: val2017
+      anno_path: annotations/instances_val2017.json
+      dataset_dir: dataset/coco
+      with_background: false
+  sample_transforms:
+    - !DecodeImage
+      to_rgb: True
+    - !ResizeImage
+      target_size: 608
+      interp: 2
+    - !NormalizeImage
+      mean: [0.485, 0.456, 0.406]
+      std: [0.229, 0.224, 0.225]
+      is_scale: True
+      is_channel_first: false
+    - !PadBox
+      num_max_boxes: 50
+    - !Permute
+      to_bgr: false
+      channel_first: True
+  batch_size: 8
+  drop_empty: false
+  worker_num: 8
+  bufsize: 4
+TestReader:
+  inputs_def:
+    image_shape: [3, 608, 608]
+    fields: ['image', 'im_size', 'im_id']
+  dataset:
+    !ImageFolder
+      anno_path: annotations/instances_val2017.json
+      with_background: false
+  sample_transforms:
+    - !DecodeImage
+      to_rgb: True
+    - !ResizeImage
+      target_size: 608
+      interp: 2
+    - !NormalizeImage
+      mean: [0.485, 0.456, 0.406]
+      std: [0.229, 0.224, 0.225]
+      is_scale: True
+      is_channel_first: false
+    - !Permute
+      to_bgr: false
+      channel_first: True
+  batch_size: 1
--- a/configs/ppyolo/ppyolo_test.yml
+++ b/configs/ppyolo/ppyolo_test.yml
+# NOTE: this config file is only used for evaluation on COCO test2019 set,
+#       for training or evaluationg on COCO val2017, please use ppyolo.yml
+architecture: YOLOv3
+use_gpu: true
+max_iters: 500000
+log_smooth_window: 100
+log_iter: 100
+save_dir: output
+snapshot_iter: 10000
+metric: COCO
+pretrain_weights: https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_vd_ssld_pretrained.tar
+weights: output/ppyolo/model_final
+num_classes: 80
+use_fine_grained_loss: true
+use_ema: true
+ema_decay: 0.9998
+save_prediction_only: True
+YOLOv3:
+  backbone: ResNet
+  yolo_head: YOLOv3Head
+  use_fine_grained_loss: true
+ResNet:
+  norm_type: sync_bn
+  freeze_at: 0
+  freeze_norm: false
+  norm_decay: 0.
+  depth: 50
+  feature_maps: [3, 4, 5]
+  variant: d
+  dcn_v2_stages: [5]
+YOLOv3Head:
+  anchor_masks: [[6, 7, 8], [3, 4, 5], [0, 1, 2]]
+  anchors: [[10, 13], [16, 30], [33, 23],
+            [30, 61], [62, 45], [59, 119],
+            [116, 90], [156, 198], [373, 326]]
+  norm_decay: 0.
+  coord_conv: true
+  iou_aware: true
+  iou_aware_factor: 0.4
+  scale_x_y: 1.05
+  spp: true
+  yolo_loss: YOLOv3Loss
+  nms: MatrixNMS
+  drop_block: true
+YOLOv3Loss:
+  batch_size: 24
+  ignore_thresh: 0.7
+  scale_x_y: 1.05
+  label_smooth: false
+  use_fine_grained_loss: true
+  iou_loss: IouLoss
+  iou_aware_loss: IouAwareLoss
+IouLoss:
+  loss_weight: 2.5
+  max_height: 608
+  max_width: 608
+IouAwareLoss:
+  loss_weight: 1.0
+  max_height: 608
+  max_width: 608
+MatrixNMS:
+    background_label: -1
+    keep_top_k: 100
+    normalized: false
+    score_threshold: 0.01
+    post_threshold: 0.01
+LearningRate:
+  base_lr: 0.00333
+  schedulers:
+  - !PiecewiseDecay
+    gamma: 0.1
+    milestones:
+    - 400000
+    - 450000
+  - !LinearWarmup
+    start_factor: 0.
+    steps: 4000
+OptimizerBuilder:
+  optimizer:
+    momentum: 0.9
+    type: Momentum
+  regularizer:
+    factor: 0.0005
+    type: L2
+_READER_: 'ppyolo_reader.yml'
+EvalReader:
+  inputs_def:
+    fields: ['image', 'im_size', 'im_id']
+    num_max_boxes: 90
+  dataset:
+    !COCODataSet
+      image_dir: test2017
+      anno_path: annotations/image_info_test-dev2017.json
+      dataset_dir: dataset/coco
+      with_background: false
+  sample_transforms:
+    - !DecodeImage
+      to_rgb: True
+    - !ResizeImage
+      target_size: 608
+      interp: 1
+    - !NormalizeImage
+      mean: [0.485, 0.456, 0.406]
+      std: [0.229, 0.224, 0.225]
+      is_scale: True
+      is_channel_first: false
+    - !Permute
+      to_bgr: false
+      channel_first: True
+  batch_size: 1
+TestReader:
+  dataset:
+    !ImageFolder
+    use_default_label: true
+    with_background: false
+  sample_transforms:
+    - !DecodeImage
+      to_rgb: True
+    - !ResizeImage
+      target_size: 608
+      interp: 1
+    - !NormalizeImage
+      mean: [0.485, 0.456, 0.406]
+      std: [0.229, 0.224, 0.225]
+      is_scale: True
+      is_channel_first: false
+    - !Permute
+      to_bgr: false
+      channel_first: True
--- a/configs/ppyolo/ppyolo_tiny.yml
+++ b/configs/ppyolo/ppyolo_tiny.yml
+architecture: YOLOv3
+use_gpu: true
+max_iters: 250000
+log_smooth_window: 20
+log_iter: 20
+save_dir: output
+snapshot_iter: 10000
+metric: COCO
+pretrain_weights: https://paddle-imagenet-models-name.bj.bcebos.com/ResNet18_vd_pretrained.tar
+weights: output/ppyolo_tiny/model_final
+num_classes: 80
+use_fine_grained_loss: true
+use_ema: true
+ema_decay: 0.9998
+YOLOv3:
+  backbone: ResNet
+  yolo_head: YOLOv3Head
+  use_fine_grained_loss: true
+ResNet:
+  norm_type: sync_bn
+  freeze_at: 0
+  freeze_norm: false
+  norm_decay: 0.
+  depth: 18
+  feature_maps: [4, 5]
+  variant: d
+YOLOv3Head:
+  anchor_masks: [[3, 4, 5], [0, 1, 2]]
+  anchors: [[10, 14], [23, 27], [37, 58],
+            [81, 82], [135, 169], [344, 319]]
+  norm_decay: 0.
+  conv_block_num: 0
+  scale_x_y: 1.05
+  yolo_loss: YOLOv3Loss
+  nms: MatrixNMS
+  drop_block: true
+YOLOv3Loss:
+  batch_size: 32
+  ignore_thresh: 0.7
+  scale_x_y: 1.05
+  label_smooth: false
+  use_fine_grained_loss: true
+  iou_loss: IouLoss
+IouLoss:
+  loss_weight: 2.5
+  max_height: 608
+  max_width: 608
+MatrixNMS:
+    background_label: -1
+    keep_top_k: 100
+    normalized: false
+    score_threshold: 0.01
+    post_threshold: 0.01
+LearningRate:
+  base_lr: 0.004
+  schedulers:
+  - !PiecewiseDecay
+    gamma: 0.1
+    milestones:
+    - 150000
+    - 200000
+  - !LinearWarmup
+    start_factor: 0.
+    steps: 4000
+OptimizerBuilder:
+  optimizer:
+    momentum: 0.9
+    type: Momentum
+  regularizer:
+    factor: 0.0005
+    type: L2
+_READER_: 'ppyolo_reader.yml'
+TrainReader:
+  inputs_def:
+    fields: ['image', 'gt_bbox', 'gt_class', 'gt_score']
+    num_max_boxes: 50
+  dataset:
+    !COCODataSet
+      image_dir: train2017
+      anno_path: annotations/instances_train2017.json
+      dataset_dir: train_data/dataset/coco
+      with_background: false
+  sample_transforms:
+    - !DecodeImage
+      to_rgb: True
+      with_mixup: True
+    - !MixupImage
+      alpha: 1.5
+      beta: 1.5
+    - !ColorDistort {}
+    - !RandomExpand
+      fill_value: [123.675, 116.28, 103.53]
+    - !RandomCrop {}
+    - !RandomFlipImage
+      is_normalized: false
+    - !NormalizeBox {}
+    - !PadBox
+      num_max_boxes: 50
+    - !BboxXYXY2XYWH {}
+  batch_transforms:
+  - !RandomShape
+    sizes: [320, 352, 384, 416, 448, 480, 512, 544, 576, 608]
+    random_inter: True
+  - !NormalizeImage
+    mean: [0.485, 0.456, 0.406]
+    std: [0.229, 0.224, 0.225]
+    is_scale: True
+    is_channel_first: false
+  - !Permute
+    to_bgr: false
+    channel_first: True
+  # Gt2YoloTarget is only used when use_fine_grained_loss set as true,
+  # this operator will be deleted automatically if use_fine_grained_loss
+  # is set as false
+  - !Gt2YoloTarget
+    anchor_masks: [[3, 4, 5], [0, 1, 2]]
+    anchors: [[10, 14], [23, 27], [37, 58],
+              [81, 82], [135, 169], [344, 319]]
+    downsample_ratios: [32, 16]
+  batch_size: 32
+  shuffle: true
+  mixup_epoch: 500
+  drop_last: true
+  worker_num: 16
+  bufsize: 8
+  use_process: true
--- a/deploy/python/infer.py
+++ b/deploy/python/infer.py
@@ -466,7 +466,12 @@ class Detector():
            results['masks'] = np_masks
        return results
-    def predict(self, image, threshold=0.5, warmup=0, repeats=1):
+    def predict(self,
+                image,
+                threshold=0.5,
+                warmup=0,
+                repeats=1,
+                run_benchmark=False):
        '''
        Args:
            image (str/np.ndarray): path of image/ np.ndarray read by cv2
@@ -500,7 +505,7 @@ class Detector():
                np_masks = np.array(outs[1])
        else:
            input_names = self.predictor.get_input_names()
-            for i in range(len(inputs)):
+            for i in range(len(input_names)):
                input_tensor = self.predictor.get_input_tensor(input_names[i])
                input_tensor.copy_from_cpu(inputs[input_names[i]])
@@ -528,12 +533,15 @@ class Detector():
            ms = (t2 - t1) * 1000.0 / repeats
            print("Inference: {} ms per batch image".format(ms))
-        if reduce(lambda x, y: x * y, np_boxes.shape) < 6:
+        # do not perform postprocess in benchmark mode
-            print('[WARNNING] No object detected.')
+        results = []
-            results = {'boxes': np.array([])}
+        if not run_benchmark:
-        else:
+            if reduce(lambda x, y: x * y, np_boxes.shape) < 6:
-            results = self.postprocess(
+                print('[WARNNING] No object detected.')
-                np_boxes, np_masks, im_info, threshold=threshold)
+                results = {'boxes': np.array([])}
+            else:
+                results = self.postprocess(
+                    np_boxes, np_masks, im_info, threshold=threshold)
        return results
@@ -543,7 +551,11 @@ def predict_image():
        FLAGS.model_dir, use_gpu=FLAGS.use_gpu, run_mode=FLAGS.run_mode)
    if FLAGS.run_benchmark:
        detector.predict(
-            FLAGS.image_file, FLAGS.threshold, warmup=100, repeats=100)
+            FLAGS.image_file,
+            FLAGS.threshold,
+            warmup=100,
+            repeats=100,
+            run_benchmark=True)
    else:
        results = detector.predict(FLAGS.image_file, FLAGS.threshold)
        visualize(

--- a/docs/CHANGELOG.md
+++ b/docs/CHANGELOG.md
@@ -2,6 +2,24 @@
 ## 最新版本信息
+### v0.4.0(07/2020)
+  - 模型丰富度提升：
+    - 发布PPYOLO模型，COCO数据集精度达到45.2%，单卡V100预测速度达到72.9 FPS，精度和预测速度优于YOLOv4模型。
+    - 新增TTFNet模型，base版本对齐竞品，COCO数据集精度达到32.9%。
+    - 新增HTC模型，base版本对齐竞品，COCO数据集精度达到42.2%。
+    - 新增BlazeFace人脸关键点检测模型，在Wider-Face数据集的Easy-Set精度达到85.2%。
+    - 新增ACFPN模型， COCO数据集精度达到39.6%。
+    - 发布服务器端通用目标检测模型（包含676类），相同策略在COCO数据集上，V100为19.5FPS时，COCO mAP可以达到49.4%。
+  - 移动端模型优化：
+    - 新增SSDLite系列优化模型，包括新增GhostNet的Backbone，新增FPN组件等，精度提升0.5%-1.5%。
+  - 易用性提升机功能组件：
+    - 新增GridMask, RandomErasing数据增强方法。
+    - 新增Matrix NMS支持。
+    - 新增EMA(Exponential Moving Average)训练支持。
+    - 新增多机训练方法，两机相对于单机平均加速比80%，多机训练支持待进一步验证。
 ### v0.3.0(05/2020)
  - 模型丰富度提升：
    - 添加Efficientdet-D0模型，速度与精度优于竞品。

--- a/docs/MODEL_ZOO.md
+++ b/docs/MODEL_ZOO.md
@@ -226,12 +226,12 @@ randomly cropping, randomly expansion, randomly flipping.
 ### Face Detection
-Please refer [face detection models](https://github.com/PaddlePaddle/PaddleDetection/blob/master/configs/face_detection) for details.
+Please refer [face detection models](featured_model/FACE_DETECTION_en.md) for details.
 ### Object Detection in Open Images Dataset V5
-Please refer [Open Images Dataset V5 Baseline model](featured_model/OIDV5_BASELINE_MODEL.md) for details.
+Please refer [Open Images Dataset V5 Baseline model](featured_model/champion_model/OIDV5_BASELINE_MODEL.md) for details.
 ### Anchor Free Models

--- a/docs/MODEL_ZOO_cn.md
+++ b/docs/MODEL_ZOO_cn.md
@@ -221,7 +221,7 @@ Paddle提供基于ImageNet的骨架网络预训练模型。所有预训练模型
 ### 基于Open Images V5数据集的物体检测
-详细请参考[Open Images V5数据集基线模型](featured_model/OIDV5_BASELINE_MODEL.md)。
+详细请参考[Open Images V5数据集基线模型](featured_model/champion_model/OIDV5_BASELINE_MODEL.md)。
 ### Anchor Free系列模型

--- a/docs/images/ppyolo_map_fps.png
+++ b/docs/images/ppyolo_map_fps.png
--- a/ppdet/modeling/anchor_heads/yolo_head.py
+++ b/ppdet/modeling/anchor_heads/yolo_head.py
--- a/ppdet/modeling/architectures/blazeface.py
+++ b/ppdet/modeling/architectures/blazeface.py
@@ -251,7 +251,9 @@ class BlazeFace(object):
    def eval(self, feed_vars):
        return self.build(feed_vars, 'eval')
-    def test(self, feed_vars):
+    def test(self, feed_vars, exclude_nms=False):
+        assert not exclude_nms, "exclude_nms for {} is not support currently".format(
+            self.__class__.__name__)
        return self.build(feed_vars, 'test')
    def is_bbox_normalized(self):

--- a/ppdet/modeling/architectures/cascade_mask_rcnn.py
+++ b/ppdet/modeling/architectures/cascade_mask_rcnn.py
@@ -434,5 +434,7 @@ class CascadeMaskRCNN(object):
            return self.build_multi_scale(feed_vars, mask_branch)
        return self.build(feed_vars, 'test')
-    def test(self, feed_vars):
+    def test(self, feed_vars, exclude_nms=False):
+        assert not exclude_nms, "exclude_nms for {} is not support currently".format(
+            self.__class__.__name__)
        return self.build(feed_vars, 'test')
--- a/ppdet/modeling/architectures/cascade_rcnn.py
+++ b/ppdet/modeling/architectures/cascade_rcnn.py
@@ -331,5 +331,7 @@ class CascadeRCNN(object):
            return self.build_multi_scale(feed_vars)
        return self.build(feed_vars, 'test')
-    def test(self, feed_vars):
+    def test(self, feed_vars, exclude_nms=False):
+        assert not exclude_nms, "exclude_nms for {} is not support currently".format(
+            self.__class__.__name__)
        return self.build(feed_vars, 'test')
--- a/ppdet/modeling/architectures/cascade_rcnn_cls_aware.py
+++ b/ppdet/modeling/architectures/cascade_rcnn_cls_aware.py
@@ -319,5 +319,7 @@ class CascadeRCNNClsAware(object):
            return self.build_multi_scale(feed_vars)
        return self.build(feed_vars, 'test')
-    def test(self, feed_vars):
+    def test(self, feed_vars, exclude_nms=False):
+        assert not exclude_nms, "exclude_nms for {} is not support currently".format(
+            self.__class__.__name__)
        return self.build(feed_vars, 'test')
--- a/ppdet/modeling/architectures/cornernet_squeeze.py
+++ b/ppdet/modeling/architectures/cornernet_squeeze.py
@@ -136,5 +136,7 @@ class CornerNetSqueeze(object):
    def eval(self, feed_vars):
        return self.build(feed_vars, mode='test')
-    def test(self, feed_vars):
+    def test(self, feed_vars, exclude_nms=False):
+        assert not exclude_nms, "exclude_nms for {} is not support currently".format(
+            self.__class__.__name__)
        return self.build(feed_vars, mode='test')
--- a/ppdet/modeling/architectures/efficientdet.py
+++ b/ppdet/modeling/architectures/efficientdet.py
@@ -146,5 +146,7 @@ class EfficientDet(object):
    def eval(self, feed_vars):
        return self.build(feed_vars, 'test')
-    def test(self, feed_vars):
+    def test(self, feed_vars, exclude_nms=False):
+        assert not exclude_nms, "exclude_nms for {} is not support currently".format(
+            self.__class__.__name__)
        return self.build(feed_vars, 'test')
--- a/ppdet/modeling/architectures/faceboxes.py
+++ b/ppdet/modeling/architectures/faceboxes.py
@@ -183,7 +183,9 @@ class FaceBoxes(object):
    def eval(self, feed_vars):
        return self.build(feed_vars, 'eval')
-    def test(self, feed_vars):
+    def test(self, feed_vars, exclude_nms=False):
+        assert not exclude_nms, "exclude_nms for {} is not support currently".format(
+            self.__class__.__name__)
        return self.build(feed_vars, 'test')
    def is_bbox_normalized(self):

--- a/ppdet/modeling/architectures/faster_rcnn.py
+++ b/ppdet/modeling/architectures/faster_rcnn.py
@@ -244,5 +244,7 @@ class FasterRCNN(object):
            return self.build_multi_scale(feed_vars)
        return self.build(feed_vars, 'test')
-    def test(self, feed_vars):
+    def test(self, feed_vars, exclude_nms=False):
+        assert not exclude_nms, "exclude_nms for {} is not support currently".format(
+            self.__class__.__name__)
        return self.build(feed_vars, 'test')
--- a/ppdet/modeling/architectures/fcos.py
+++ b/ppdet/modeling/architectures/fcos.py
@@ -179,5 +179,7 @@ class FCOS(object):
    def eval(self, feed_vars):
        return self.build(feed_vars, 'test')
-    def test(self, feed_vars):
+    def test(self, feed_vars, exclude_nms=False):
+        assert not exclude_nms, "exclude_nms for {} is not support currently".format(
+            self.__class__.__name__)
        return self.build(feed_vars, 'test')
--- a/ppdet/modeling/architectures/mask_rcnn.py
+++ b/ppdet/modeling/architectures/mask_rcnn.py
@@ -337,5 +337,7 @@ class MaskRCNN(object):
            return self.build_multi_scale(feed_vars, mask_branch)
        return self.build(feed_vars, 'test')
-    def test(self, feed_vars):
+    def test(self, feed_vars, exclude_nms=False):
+        assert not exclude_nms, "exclude_nms for {} is not support currently".format(
+            self.__class__.__name__)
        return self.build(feed_vars, 'test')
--- a/ppdet/modeling/architectures/retinanet.py
+++ b/ppdet/modeling/architectures/retinanet.py
@@ -125,5 +125,7 @@ class RetinaNet(object):
    def eval(self, feed_vars):
        return self.build(feed_vars, 'test')
-    def test(self, feed_vars):
+    def test(self, feed_vars, exclude_nms=False):
+        assert not exclude_nms, "exclude_nms for {} is not support currently".format(
+            self.__class__.__name__)
        return self.build(feed_vars, 'test')
--- a/ppdet/modeling/architectures/ssd.py
+++ b/ppdet/modeling/architectures/ssd.py
@@ -134,7 +134,9 @@ class SSD(object):
    def eval(self, feed_vars):
        return self.build(feed_vars, 'eval')
-    def test(self, feed_vars):
+    def test(self, feed_vars, exclude_nms=False):
+        assert not exclude_nms, "exclude_nms for {} is not support currently".format(
+            self.__class__.__name__)
        return self.build(feed_vars, 'test')
    def is_bbox_normalized(self):

--- a/ppdet/modeling/architectures/yolo.py
+++ b/ppdet/modeling/architectures/yolo.py
@@ -49,7 +49,7 @@ class YOLOv3(object):
        self.yolo_head = yolo_head
        self.use_fine_grained_loss = use_fine_grained_loss
-    def build(self, feed_vars, mode='train'):
+    def build(self, feed_vars, mode='train', exclude_nms=False):
        im = feed_vars['image']
        mixed_precision_enabled = mixed_precision_global_state() is not None
@@ -74,9 +74,9 @@ class YOLOv3(object):
            gt_score = feed_vars['gt_score']
            # Get targets for splited yolo loss calculation
-            # YOLOv3 supports up to 3 output layers currently
+            num_output_layer = len(self.yolo_head.anchor_masks)
            targets = []
-            for i in range(3):
+            for i in range(num_output_layer):
                k = 'target{}'.format(i)
                if k in feed_vars:
                    targets.append(feed_vars[k])
@@ -88,7 +88,9 @@ class YOLOv3(object):
            return loss
        else:
            im_size = feed_vars['im_size']
-            return self.yolo_head.get_prediction(body_feats, im_size)
+            # exclude_nms only for benchmark, postprocess(NMS) is not needed
+            return self.yolo_head.get_prediction(
+                body_feats, im_size, exclude_nms=exclude_nms)
    def _inputs_def(self, image_shape, num_max_boxes):
        im_shape = [None] + image_shape
@@ -106,11 +108,10 @@ class YOLOv3(object):
        if self.use_fine_grained_loss:
            # yapf: disable
-            targets_def = {
+            num_output_layer = len(self.yolo_head.anchor_masks)
-                'target0':  {'shape': [None, 3, 86, 19, 19],  'dtype': 'float32',   'lod_level': 0},
+            targets_def = {}
-                'target1':  {'shape': [None, 3, 86, 38, 38],  'dtype': 'float32',   'lod_level': 0},
+            for i in range(num_output_layer):
-                'target2':  {'shape': [None, 3, 86, 76, 76],  'dtype': 'float32',   'lod_level': 0},
+                targets_def['target{}'.format(i)] = {'shape': [None, 3, None, None, None],  'dtype': 'float32',   'lod_level': 0}
-            }
            # yapf: enable
            downsample = 32
@@ -139,7 +140,9 @@ class YOLOv3(object):
        # will be disabled for YOLOv3 architecture do not calculate loss in
        # eval/infer mode.
        if 'im_size' not in fields and self.use_fine_grained_loss:
-            fields.extend(['target0', 'target1', 'target2'])
+            num_output_layer = len(self.yolo_head.anchor_masks)
+            fields.extend(
+                ['target{}'.format(i) for i in range(num_output_layer)])
        feed_vars = OrderedDict([(key, fluid.data(
            name=key,
            shape=inputs_def[key]['shape'],
@@ -158,8 +161,8 @@ class YOLOv3(object):
    def eval(self, feed_vars):
        return self.build(feed_vars, mode='test')
-    def test(self, feed_vars):
+    def test(self, feed_vars, exclude_nms=False):
-        return self.build(feed_vars, mode='test')
+        return self.build(feed_vars, mode='test', exclude_nms=exclude_nms)
 @register

--- a/ppdet/modeling/backbones/mobilenet_v3.py
+++ b/ppdet/modeling/backbones/mobilenet_v3.py
@@ -68,6 +68,9 @@ class MobileNetV3(object):
        if isinstance(feature_maps, Integral):
            feature_maps = [feature_maps]
+        if norm_type == 'sync_bn' and freeze_norm:
+            raise ValueError(
+                "The norm_type should not be sync_bn when freeze_norm is True")
        self.scale = scale
        self.model_name = model_name
        self.feature_maps = feature_maps
@@ -437,16 +440,15 @@ class MobileNetV3(object):
 @register
 class MobileNetV3RCNN(MobileNetV3):
-    def __init__(
+    def __init__(self,
-            self,
+                 scale=1.0,
-            scale=1.0,
+                 model_name='large',
-            model_name='large',
+                 conv_decay=0.0,
-            conv_decay=0.0,
+                 norm_type='bn',
-            norm_type='bn',
+                 norm_decay=0.0,
-            norm_decay=0.0,
+                 freeze_norm=True,
-            freeze_norm=True,
+                 feature_maps=[2, 3, 4, 5],
-            feature_maps=[2, 3, 4, 5],
+                 lr_mult_list=[1.0, 1.0, 1.0, 1.0, 1.0]):
-            lr_mult_list=[1.0, 1.0, 1.0, 1.0, 1.0], ):
        super(MobileNetV3RCNN, self).__init__(
            scale=scale,
            model_name=model_name,
@@ -454,7 +456,8 @@ class MobileNetV3RCNN(MobileNetV3):
            norm_type=norm_type,
            norm_decay=norm_decay,
            lr_mult_list=lr_mult_list,
-            feature_maps=feature_maps)
+            feature_maps=feature_maps,
+            freeze_norm=freeze_norm)
        self.curr_stage = 0
        self.block_stride = 1

--- a/ppdet/modeling/losses/iou_aware_loss.py
+++ b/ppdet/modeling/losses/iou_aware_loss.py
@@ -54,6 +54,7 @@ class IouAwareLoss(IouLoss):
                 anchors,
                 downsample_ratio,
                 batch_size,
+                 scale_x_y,
                 eps=1.e-10):
        '''
        Args:
@@ -67,9 +68,9 @@ class IouAwareLoss(IouLoss):
        '''
        pred = self._bbox_transform(x, y, w, h, anchors, downsample_ratio,
-                                    batch_size, False)
+                                    batch_size, False, scale_x_y, eps)
        gt = self._bbox_transform(tx, ty, tw, th, anchors, downsample_ratio,
-                                  batch_size, True)
+                                  batch_size, True, scale_x_y, eps)
        iouk = self._iou(pred, gt, ioup, eps)
        iouk.stop_gradient = True

--- a/ppdet/modeling/losses/iou_loss.py
+++ b/ppdet/modeling/losses/iou_loss.py
@@ -63,6 +63,7 @@ class IouLoss(object):
                 anchors,
                 downsample_ratio,
                 batch_size,
+                 scale_x_y=1.,
                 ioup=None,
                 eps=1.e-10):
        '''
@@ -75,9 +76,9 @@ class IouLoss(object):
            eps (float): the decimal to prevent the denominator eqaul zero
        '''
        pred = self._bbox_transform(x, y, w, h, anchors, downsample_ratio,
-                                    batch_size, False)
+                                    batch_size, False, scale_x_y, eps)
        gt = self._bbox_transform(tx, ty, tw, th, anchors, downsample_ratio,
-                                  batch_size, True)
+                                  batch_size, True, scale_x_y, eps)
        iouk = self._iou(pred, gt, ioup, eps)
        if self.loss_square:
            loss_iou = 1. - iouk * iouk
@@ -145,7 +146,7 @@ class IouLoss(object):
        return diou_term + ciou_term
    def _bbox_transform(self, dcx, dcy, dw, dh, anchors, downsample_ratio,
-                        batch_size, is_gt):
+                        batch_size, is_gt, scale_x_y, eps):
        grid_x = int(self._MAX_WI / downsample_ratio)
        grid_y = int(self._MAX_HI / downsample_ratio)
        an_num = len(anchors) // 2
@@ -179,8 +180,11 @@ class IouLoss(object):
            cy.gradient = True
        else:
            dcx_sig = fluid.layers.sigmoid(dcx)
-            cx = fluid.layers.elementwise_add(dcx_sig, gi) / grid_x_act
            dcy_sig = fluid.layers.sigmoid(dcy)
+            if (abs(scale_x_y - 1.0) > eps):
+                dcx_sig = scale_x_y * dcx_sig - 0.5 * (scale_x_y - 1)
+                dcy_sig = scale_x_y * dcy_sig - 0.5 * (scale_x_y - 1)
+            cx = fluid.layers.elementwise_add(dcx_sig, gi) / grid_x_act
            cy = fluid.layers.elementwise_add(dcy_sig, gj) / grid_y_act
        anchor_w_ = [anchors[i] for i in range(0, len(anchors)) if i % 2 == 0]

--- a/ppdet/modeling/losses/yolo_loss.py
+++ b/ppdet/modeling/losses/yolo_loss.py
@@ -91,8 +91,15 @@ class YOLOv3Loss(object):
            return {'loss': sum(losses)}
-    def _get_fine_grained_loss(self, outputs, targets, gt_box, batch_size,
+    def _get_fine_grained_loss(self,
-                               num_classes, mask_anchors, ignore_thresh):
+                               outputs,
+                               targets,
+                               gt_box,
+                               batch_size,
+                               num_classes,
+                               mask_anchors,
+                               ignore_thresh,
+                               eps=1.e-10):
        """
        Calculate fine grained YOLOv3 loss
@@ -136,12 +143,27 @@ class YOLOv3Loss(object):
            tx, ty, tw, th, tscale, tobj, tcls = self._split_target(target)
            tscale_tobj = tscale * tobj
-            loss_x = fluid.layers.sigmoid_cross_entropy_with_logits(
-                x, tx) * tscale_tobj
+            scale_x_y = self.scale_x_y if not isinstance(
-            loss_x = fluid.layers.reduce_sum(loss_x, dim=[1, 2, 3])
+                self.scale_x_y, Sequence) else self.scale_x_y[i]
-            loss_y = fluid.layers.sigmoid_cross_entropy_with_logits(
-                y, ty) * tscale_tobj
+            if (abs(scale_x_y - 1.0) < eps):
-            loss_y = fluid.layers.reduce_sum(loss_y, dim=[1, 2, 3])
+                loss_x = fluid.layers.sigmoid_cross_entropy_with_logits(
+                    x, tx) * tscale_tobj
+                loss_x = fluid.layers.reduce_sum(loss_x, dim=[1, 2, 3])
+                loss_y = fluid.layers.sigmoid_cross_entropy_with_logits(
+                    y, ty) * tscale_tobj
+                loss_y = fluid.layers.reduce_sum(loss_y, dim=[1, 2, 3])
+            else:
+                dx = scale_x_y * fluid.layers.sigmoid(x) - 0.5 * (scale_x_y -
+                                                                  1.0)
+                dy = scale_x_y * fluid.layers.sigmoid(y) - 0.5 * (scale_x_y -
+                                                                  1.0)
+                loss_x = fluid.layers.abs(dx - tx) * tscale_tobj
+                loss_x = fluid.layers.reduce_sum(loss_x, dim=[1, 2, 3])
+                loss_y = fluid.layers.abs(dy - ty) * tscale_tobj
+                loss_y = fluid.layers.reduce_sum(loss_y, dim=[1, 2, 3])
            # NOTE: we refined loss function of (w, h) as L1Loss
            loss_w = fluid.layers.abs(w - tw) * tscale_tobj
            loss_w = fluid.layers.reduce_sum(loss_w, dim=[1, 2, 3])
@@ -149,7 +171,8 @@ class YOLOv3Loss(object):
            loss_h = fluid.layers.reduce_sum(loss_h, dim=[1, 2, 3])
            if self._iou_loss is not None:
                loss_iou = self._iou_loss(x, y, w, h, tx, ty, tw, th, anchors,
-                                          downsample, self._batch_size)
+                                          downsample, self._batch_size,
+                                          scale_x_y)
                loss_iou = loss_iou * tscale_tobj
                loss_iou = fluid.layers.reduce_sum(loss_iou, dim=[1, 2, 3])
                loss_ious.append(fluid.layers.reduce_mean(loss_iou))
@@ -157,14 +180,12 @@ class YOLOv3Loss(object):
            if self._iou_aware_loss is not None:
                loss_iou_aware = self._iou_aware_loss(
                    ioup, x, y, w, h, tx, ty, tw, th, anchors, downsample,
-                    self._batch_size)
+                    self._batch_size, scale_x_y)
                loss_iou_aware = loss_iou_aware * tobj
                loss_iou_aware = fluid.layers.reduce_sum(
                    loss_iou_aware, dim=[1, 2, 3])
                loss_iou_awares.append(fluid.layers.reduce_mean(loss_iou_aware))
-            scale_x_y = self.scale_x_y if not isinstance(
-                self.scale_x_y, Sequence) else self.scale_x_y[i]
            loss_obj_pos, loss_obj_neg = self._calc_obj_loss(
                output, obj, tobj, gt_box, self._batch_size, anchors,
                num_classes, downsample, self._ignore_thresh, scale_x_y)

--- a/ppdet/modeling/ops.py
+++ b/ppdet/modeling/ops.py
@@ -526,7 +526,7 @@ class MatrixNMS(object):
                 gaussian_sigma=2.,
                 normalized=False,
                 background_label=0):
-        super(MultiClassNMS, self).__init__()
+        super(MatrixNMS, self).__init__()
        self.score_threshold = score_threshold
        self.post_threshold = post_threshold
        self.nms_top_k = nms_top_k

--- a/ppdet/modeling/tests/test_architectures.py
+++ b/ppdet/modeling/tests/test_architectures.py
@@ -19,6 +19,7 @@ from __future__ import print_function
 import unittest
 import numpy as np
+import paddle
 import paddle.fluid as fluid
 import os
 import sys
@@ -70,6 +71,10 @@ class TestCascadeRCNN(TestFasterRCNN):
        self.cfg_file = 'configs/cascade_rcnn_r50_fpn_1x.yml'
+@unittest.skipIf(
+    paddle.version.major < "2",
+    "Paddle 2.0 should be used for YOLOv3 takes scale_x_y as inputs, "
+    "disable this unittest for Paddle major version < 2")
 class TestYolov3(TestFasterRCNN):
    def set_config(self):
        self.cfg_file = 'configs/yolov3_darknet.yml'

--- a/tools/export_model.py
+++ b/tools/export_model.py
@@ -196,7 +196,8 @@ def main():
            inputs_def = cfg['TestReader']['inputs_def']
            inputs_def['use_dataloader'] = False
            feed_vars, _ = model.build_inputs(**inputs_def)
-            test_fetches = model.test(feed_vars)
+            # postprocess not need in exclude_nms, exclude NMS in exclude_nms mode
+            test_fetches = model.test(feed_vars, exclude_nms=FLAGS.exclude_nms)
    infer_prog = infer_prog.clone(True)
    check_py_func(infer_prog)
@@ -214,6 +215,11 @@ if __name__ == '__main__':
        type=str,
        default="output",
        help="Directory for storing the output model files.")
+    parser.add_argument(
+        "--exclude_nms",
+        action='store_true',
+        default=False,
+        help="Whether prune NMS for benchmark")
    FLAGS = parser.parse_args()
    main()