[cherry-pick] add doc of benchmark (#2505) (#2513)

* [doc] add doc of benchmark (#2505) * add benchmark, test=document_fix * update style of number, test=document_fix * add difference of dynamic shape and fix shape * update for ci-travis, test=document_fix

[cherry-pick] add doc of benchmark (#2505) (#2513)
* [doc] add doc of benchmark (#2505) * add benchmark, test=document_fix * update style of number, test=document_fix * add difference of dynamic shape and fix shape * update for ci-travis, test=document_fix
79906929 · cnn · GitHub · cee456e5 · 79906929 · 79906929
5 changed file
--- a/README.md
+++ b/README.md
@@ -137,11 +137,12 @@ PaddleDetection模块化地实现了多种主流目标检测算法，提供了
 ### 进阶教程

 - [模型压缩](configs/slim)
- [推理部署](deploy)
+- [推理部署](deploy/README.md)
    - [模型导出教程](deploy/EXPORT_MODEL.md)
    - [Python端推理部署](deploy/python)
    - [C++端推理部署](deploy/cpp)
    - [服务端部署](deploy/serving)
+    - [推理benchmark](deploy/BENCHMARK_INFER.md)


 ## 模型库

--- a/deploy/BENCHMARK_INFER.md
+++ b/deploy/BENCHMARK_INFER.md
+# 推理Benchmark
+
+## 环境准备
+- 测试环境:
+  - CUDA 10.1
+  - CUDNN 7.6
+  - TensorRT-6.0.1
+  - PaddlePaddle v2.0.1
+  - GPU分别为: Tesla V100和Windows 1080Ti和TX2
+- 测试方式:
+  - 为了方便比较不同模型的推理速度，输入采用同样大小的图片，为 3x640x640，采用 `demo/000000014439_640x640.jpg` 图片。
+  - Batch Size=1
+  - 去掉前100轮warmup时间，测试100轮的平均时间，单位ms/image，包括网络计算时间、数据拷贝至CPU的时间。
+  - 采用Fluid C++预测引擎: 包含Fluid C++预测、Fluid-TensorRT预测，下面同时测试了Float32 (FP32) 和Float16 (FP16)的推理速度。
+
+**注意：**TensorRT中固定尺寸和动态尺寸区别请参考文档[TENSOR教程](TENSOR_RT.md)。由于固定尺寸下对两阶段模型支持不完善，所以faster rcnn模型采用动态尺寸测试。固定尺寸和动态尺寸支持融合的OP不完全一样，因此同一个模型在固定尺寸和动态尺寸下测试的性能可能会有一点差异。
+
+## 推理速度
+
+### V100
+
+| 模型                                         | 是否固定尺寸 | 入网尺寸     | paddle\_inference   | trt\_fp32   | trt\_fp16   |
+| ------------------------------------------ | ------ | -------- | ---------- | ---------- | ---------- |
+| ppyolo\_r50vd\_dcn\_1x\_coco               | 是      | 608x608  | 20.77  | 18.40443   | 13.532618  |
+| yolov3\_mobilenet\_v1\_270e\_coco          | 是      | 608x608  | 9.74   | 8.607459   | 6.275342   |
+| ssd\_mobilenet\_v1\_300\_120e\_voc         | 是      | 300x300  | 5.17   | 4.428614   | 4.292153   |
+| ttfnet\_darknet53\_1x\_coco                | 是      | 512x512  | 10.14   | 8.708397   | 5.551765   |
+| fcos\_dcn\_r50\_fpn\_1x\_coco              | 是      | 640x640  | 35.47  | 35.023315  | 34.24144   |
+| faster\_rcnn\_r50\_fpn\_1x\_coco           | 否      | 640x640  | 27.99  | 26.151001  | 21.922865  |
+| yolov3\_darknet53\_270e\_coco              | 是      | 608x608  | 17.84  | 15.431566  | 9.861447   |
+| faster\_rcnn\_r50\_fpn\_1x\_coco(800x1312) | 否      | 800x1312 | 32.49  | 25.536572  | 21.696611  |
+
+
+### Windows 1080Ti
+| 模型                                         | 是否固定尺寸 | 入网尺寸     | paddle\_inference      | trt\_fp32   | trt\_fp16  |
+| ------------------------------------------ | ------ | -------- | ---------- | ---------- | --------- |
+| ppyolo\_r50vd\_dcn\_1x\_coco               | 是      | 608x608  | 38.06  | 31.401291  | 31.939096 |
+| yolov3\_mobilenet\_v1\_270e\_coco          | 是      | 608x608  | 14.51  | 11.22542   | 11.125602 |
+| ssd\_mobilenet\_v1\_300\_120e\_voc         | 是      | 300x300  | 16.47  | 13.874813  | 13.761724 |
+| ttfnet\_darknet53\_1x\_coco                | 是      | 512x512  | 21.83  | 17.144808  | 17.092379 |
+| fcos\_dcn\_r50\_fpn\_1x\_coco              | 是      | 640x640  | 71.88  | 69.930206  | 69.523048 |
+| faster\_rcnn\_r50\_fpn\_1x\_coco           | 否      | 640x640  | 50.74  | 57.172909  | 62.081978 |
+| yolov3\_darknet53\_270e\_coco              | 是      | 608x608  | 30.26  | 23.915573  | 24.019217 |
+| faster\_rcnn\_r50\_fpn\_1x\_coco(800x1312) | 否      | 800x1312 | 50.31  | 57.613659  | 62.050724 |
+
+### nv jetson
+| 模型                                         | 是否固定尺寸 | 入网尺寸     | paddle\_inference   | trt\_fp32   | trt\_fp16   |
+| ------------------------------------------ | ------ | -------- | ---------- | ---------- | ---------- |
+| ppyolo\_r50vd\_dcn\_1x\_coco               | 是      | 608x608  | 111.80  | 99.40332   | 48.047401  |
+| yolov3\_mobilenet\_v1\_270e\_coco          | 是      | 608x608  | 48.76  | 43.832623  | 18.410919  |
+| ssd\_mobilenet\_v1\_300\_120e\_voc         | 是      | 300x300  | 10.52  | 8.840097   | 8.765652   |
+| ttfnet\_darknet53\_1x\_coco                | 是      | 512x512  | 73.77  | 64.025124  | 31.464737  |
+| fcos\_dcn\_r50\_fpn\_1x\_coco              | 是      | 640x640  | 217.11 | 214.381866 | 205.783844 |
+| faster\_rcnn\_r50\_fpn\_1x\_coco           | 否      | 640x640  | 169.45 | 158.919266 | 119.253937 |
+| yolov3\_darknet53\_270e\_coco              | 是      | 608x608  | 121.61 | 110.29866  | 42.379051  |
+| faster\_rcnn\_r50\_fpn\_1x\_coco(800x1312) | 否      | 800x1312 | 228.07 | 156.393372 | 117.026932 |
--- a/deploy/cpp/docs/Jetson_build.md
+++ b/deploy/cpp/docs/Jetson_build.md
@@ -183,17 +183,4 @@ CUDNN_LIB=/usr/lib/aarch64-linux-gnu/


 ## 性能测试
-测试环境为：硬件: TX2，JetPack版本: 4.3, Paddle预测库: 1.8.4，CUDA: 10.0, CUDNN: 7.5, TensorRT: 5.0.  
-
-去掉前100轮warmup时间，测试100轮的平均时间，单位ms/image，只计算模型运行时间，不包括数据的处理和拷贝。
-
-
-|模型 | 输入| AnalysisPredictor(ms) |
-|---|----|---|
-| yolov3_mobilenet_v1 |  608*608  | 56.243858
-| faster_rcnn_r50_1x  | 1333*1333  | 73.552460
-| faster_rcnn_r50_vd_fpn_2x | 1344*1344 | 87.582146
-| mask_rcnn_r50_fpn_1x | 1344*1344  | 107.317848
-| mask_rcnn_r50_vd_fpn_2x | 1344*1344  | 87.98.708122
-| ppyolo_r18vd | 320*320  |  22.876789
-| ppyolo_2x | 608*608  | 68.562050
+benchmark请查看[BENCHMARK_INFER](../../BENCHMARK_INFER.md)
--- a/deploy/cpp/docs/linux_build.md
+++ b/deploy/cpp/docs/linux_build.md
@@ -124,3 +124,6 @@ make
 ./build/main --model_dir=/root/projects/models/yolov3_darknet --video_path=/root/projects/images/test.mp4 --use_gpu=1
 ```
 视频文件目前支持`.mp4`格式的预测，`可视化预测结果`会保存在当前目录下`output.mp4`文件中。
+
+## 性能测试
+benchmark请查看[BENCHMARK_INFER](../../BENCHMARK_INFER.md)
--- a/deploy/cpp/docs/windows_vs2019_build.md
+++ b/deploy/cpp/docs/windows_vs2019_build.md
@@ -125,18 +125,4 @@ cd D:\projects\PaddleDetection\deploy\cpp\out\build\x64-Release


 ## 性能测试
-测试环境为：系统: Windows 10专业版系统，CPU: I9-9820X, GPU: GTX 2080 Ti，Paddle预测库: 1.8.4，CUDA: 10.0, CUDNN: 7.4.  
-
-去掉前100轮warmup时间，测试100轮的平均时间，单位ms/image，只计算模型运行时间，不包括数据的处理和拷贝。
-
-
-|模型 | AnalysisPredictor(ms) | 输入|
-|---|----|---|
-| YOLOv3-MobileNetv1 | 41.51 |  608*608
-| faster_rcnn_r50_1x | 194.47 | 1333*1333
-| faster_rcnn_r50_vd_fpn_2x | 43.35 | 1344*1344
-| mask_rcnn_r50_fpn_1x | 96.96 | 1344*1344
-| mask_rcnn_r50_vd_fpn_2x | 97.66 | 1344*1344
-| ppyolo_r18vd | 5.54 | 320*320
-| ppyolo_2x | 56.93 | 608*608
-| ttfnet_darknet | 36.17 | 512*512
+benchmark请查看[BENCHMARK_INFER](../../BENCHMARK_INFER.md)