Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
PaddleSlim
提交
6cee8749
P
PaddleSlim
项目概览
PaddlePaddle
/
PaddleSlim
1 年多 前同步成功
通知
51
Star
1434
Fork
344
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
53
列表
看板
标记
里程碑
合并请求
16
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
PaddleSlim
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
53
Issue
53
列表
看板
标记
里程碑
合并请求
16
合并请求
16
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
未验证
提交
6cee8749
编写于
8月 22, 2022
作者:
G
Guanghua Yu
提交者:
GitHub
8月 22, 2022
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
update YOLO series TRT eval demo (#1374)
上级
53cc3430
变更
4
隐藏空白更改
内联
并排
Showing
4 changed file
with
656 addition
and
16 deletion
+656
-16
example/auto_compression/pytorch_yolo_series/README.md
example/auto_compression/pytorch_yolo_series/README.md
+30
-16
example/auto_compression/pytorch_yolo_series/TensorRT/README.md
...e/auto_compression/pytorch_yolo_series/TensorRT/README.md
+41
-0
example/auto_compression/pytorch_yolo_series/TensorRT/trt_backend.py
...o_compression/pytorch_yolo_series/TensorRT/trt_backend.py
+264
-0
example/auto_compression/pytorch_yolo_series/TensorRT/trt_eval.py
...auto_compression/pytorch_yolo_series/TensorRT/trt_eval.py
+321
-0
未找到文件。
example/auto_compression/pytorch_yolo_series/README.md
浏览文件 @
6cee8749
...
...
@@ -18,23 +18,23 @@
## 2.Benchmark
| 模型 | 策略 | 输入尺寸 | mAP
<sup>
val
<br>
0.5:0.95 | 模型体积 | 预测时延
<sup><small>
FP32
</small><sup><br><sup>
|预测时延
<sup><small>
FP16
</small><sup><br><sup>
| 预测时延
<sup><small>
INT8
</small><sup><br><sup>
| 配置文件 | Inference模型 |
| :-------- |:-------- |:--------: | :--------: | :---------------------: | :----------------: | :----------------: | :---------------: | :-----------------------------: | :-----------------------------: |
| YOLOv5s | Base模型 | 640
*
640 | 37.4 | 28.1MB | 5.95ms | 2.44ms | - |
- |
[
Model
](
https://paddle-slim-models.bj.bcebos.com/act/yolov5s.onnx
)
|
| YOLOv5s | 离线量化 | 640
*
640 | 36.0 | 7.4MB | - | - | 1.87ms |
[
config
](
https://github.com/PaddlePaddle/PaddleSlim/tree/develop/example/post_training_quantization/pytorch_yolo_series
)
| - |
| YOLOv5s | ACT量化训练 | 640
*640 | **36.9** | 7.4MB | - | - | **1.87ms*
*
|
[
config
](
./configs/yolov5s_qat_dis.yaml
)
|
[
Infer Model
](
https://bj.bcebos.com/v1/paddle-slim-models/act/yolov5s_quant.tar
)
|
[
ONNX Model
](
https://bj.bcebos.com/v1/paddle-slim-models/act/yolov5s_quant.onnx
)
|
| 模型 | 策略 | 输入尺寸 | mAP
<sup>
val
<br>
0.5:0.95 | 模型体积 | 预测时延
<sup><small>
FP32
</small><sup><br><sup>
|预测时延
<sup><small>
FP16
</small><sup><br><sup>
| 预测时延
<sup><small>
INT8
</small><sup><br><sup>
|
内存占用 | 显存占用 |
配置文件 | Inference模型 |
| :-------- |:-------- |:--------: | :--------: | :---------------------: | :----------------: | :----------------: |
:----------------: | :----------------: |
:---------------: | :-----------------------------: | :-----------------------------: |
| YOLOv5s | Base模型 | 640
*
640 | 37.4 | 28.1MB | 5.95ms | 2.44ms | - |
1718MB | 705MB | - |
[
Model
](
https://paddle-slim-models.bj.bcebos.com/act/yolov5s.onnx
)
|
| YOLOv5s | 离线量化 | 640
*
640 | 36.0 | 7.4MB | - | - | 1.87ms |
736MB | 315MB |
[
config
](
https://github.com/PaddlePaddle/PaddleSlim/tree/develop/example/post_training_quantization/pytorch_yolo_series
)
| - |
| YOLOv5s | ACT量化训练 | 640
*640 | **36.9** | 7.4MB | - | - | **1.87ms*
*
|
736MB | 315MB |
[
config
](
./configs/yolov5s_qat_dis.yaml
)
|
[
Infer Model
](
https://bj.bcebos.com/v1/paddle-slim-models/act/yolov5s_quant.tar
)
|
[
ONNX Model
](
https://bj.bcebos.com/v1/paddle-slim-models/act/yolov5s_quant_onnx.tar
)
|
| | | | | | | | | |
| YOLOv6s | Base模型 | 640
*
640 | 42.4 | 65.9MB | 9.06ms | 2.90ms | -
| -
|
[
Model
](
https://paddle-slim-models.bj.bcebos.com/act/yolov6s.onnx
)
|
| YOLOv6s | KL离线量化 | 640
*
640 | 30.3 | 16.8MB | - | - | 1.83ms |
[
config
](
https://github.com/PaddlePaddle/PaddleSlim/tree/develop/example/post_training_quantization/pytorch_yolo_series
)
| - |
| YOLOv6s | 量化蒸馏训练 | 640
*640 | **41.3** | 16.8MB | - | - | **1.83ms*
*
|
[
config
](
./configs/yolov6s_qat_dis.yaml
)
|
[
Infer Model
](
https://bj.bcebos.com/v1/paddle-slim-models/act/yolov6s_quant.tar
)
|
[
ONNX Model
](
https://bj.bcebos.com/v1/paddle-slim-models/act/yolov6s_quant.onnx
)
|
| YOLOv6s | Base模型 | 640
*
640 | 42.4 | 65.9MB | 9.06ms | 2.90ms | -
| 1208MB | 555MB | -
|
[
Model
](
https://paddle-slim-models.bj.bcebos.com/act/yolov6s.onnx
)
|
| YOLOv6s | KL离线量化 | 640
*
640 | 30.3 | 16.8MB | - | - | 1.83ms |
736MB | 315MB |
[
config
](
https://github.com/PaddlePaddle/PaddleSlim/tree/develop/example/post_training_quantization/pytorch_yolo_series
)
| - |
| YOLOv6s | 量化蒸馏训练 | 640
*640 | **41.3** | 16.8MB | - | - | **1.83ms*
*
|
736MB | 315MB |
[
config
](
./configs/yolov6s_qat_dis.yaml
)
|
[
Infer Model
](
https://bj.bcebos.com/v1/paddle-slim-models/act/yolov6s_quant.tar
)
|
[
ONNX Model
](
https://bj.bcebos.com/v1/paddle-slim-models/act/yolov6s_quant_onnx.tar
)
|
| | | | | | | | | |
| YOLOv7 | Base模型 | 640
*
640 | 51.1 | 141MB | 26.84ms | 7.44ms | - |
-
|
[
Model
](
https://paddle-slim-models.bj.bcebos.com/act/yolov7.onnx
)
|
| YOLOv7 | 离线量化 | 640
*
640 | 50.2 | 36MB | - | - | 4.55ms |
[
config
](
https://github.com/PaddlePaddle/PaddleSlim/tree/develop/example/post_training_quantization/pytorch_yolo_series
)
| - |
| YOLOv7 | ACT量化训练 | 640
*640 | **50.9** | 36MB | - | - | **4.55ms*
*
|
[
config
](
./configs/yolov7_qat_dis.yaml
)
|
[
Infer Model
](
https://bj.bcebos.com/v1/paddle-slim-models/act/yolov7_quant.tar
)
|
[
ONNX Model
](
https://bj.bcebos.com/v1/paddle-slim-models/act/yolov7_quant.onnx
)
|
| YOLOv7 | Base模型 | 640
*
640 | 51.1 | 141MB | 26.84ms | 7.44ms | - |
1722MB | 917MB | -
|
[
Model
](
https://paddle-slim-models.bj.bcebos.com/act/yolov7.onnx
)
|
| YOLOv7 | 离线量化 | 640
*
640 | 50.2 | 36MB | - | - | 4.55ms |
827MB | 363MB |
[
config
](
https://github.com/PaddlePaddle/PaddleSlim/tree/develop/example/post_training_quantization/pytorch_yolo_series
)
| - |
| YOLOv7 | ACT量化训练 | 640
*640 | **50.9** | 36MB | - | - | **4.55ms*
*
|
827MB | 363MB |
[
config
](
./configs/yolov7_qat_dis.yaml
)
|
[
Infer Model
](
https://bj.bcebos.com/v1/paddle-slim-models/act/yolov7_quant.tar
)
|
[
ONNX Model
](
https://bj.bcebos.com/v1/paddle-slim-models/act/yolov7_quant_onnx.tar
)
|
| | | | | | | | | |
| YOLOv7-Tiny | Base模型 | 640
*
640 | 37.3 | 24MB | 5.06ms | 2.32ms | -
| -
|
[
Model
](
https://paddle-slim-models.bj.bcebos.com/act/yolov7-tiny.onnx
)
|
| YOLOv7-Tiny | 离线量化 | 640
*
640 | 35.8 | 6.1MB | - | - | 1.68ms | - | - |
| YOLOv7-Tiny | ACT量化训练 | 640
*640 | **37.0** | 6.1MB | - | - | **1.68ms*
*
|
[
config
](
./configs/yolov7_tiny_qat_dis.yaml
)
|
[
Infer Model
](
https://bj.bcebos.com/v1/paddle-slim-models/act/yolov7_tiny_quant.tar
)
|
[
ONNX Model
](
https://bj.bcebos.com/v1/paddle-slim-models/act/yolov7_tiny_quant.onnx
)
|
| YOLOv7-Tiny | Base模型 | 640
*
640 | 37.3 | 24MB | 5.06ms | 2.32ms | -
| 738MB | 349MB | -
|
[
Model
](
https://paddle-slim-models.bj.bcebos.com/act/yolov7-tiny.onnx
)
|
| YOLOv7-Tiny | 离线量化 | 640
*
640 | 35.8 | 6.1MB | - | - | 1.68ms |
729MB | 315MB |
- | - |
| YOLOv7-Tiny | ACT量化训练 | 640
*640 | **37.0** | 6.1MB | - | - | **1.68ms*
*
|
729MB | 315MB |
[
config
](
./configs/yolov7_tiny_qat_dis.yaml
)
|
[
Infer Model
](
https://bj.bcebos.com/v1/paddle-slim-models/act/yolov7_tiny_quant.tar
)
|
[
ONNX Model
](
https://bj.bcebos.com/v1/paddle-slim-models/act/yolov7_tiny_quant_onnx.tar
)
|
说明:
-
mAP的指标均在COCO val2017数据集中评测得到。
...
...
@@ -136,13 +136,27 @@ python eval.py --config_path=./configs/yolov7_tiny_qat_dis.yaml
## 4.预测部署
执行完自动压缩后会生成:
```
shell
├── model.pdiparams
# Paddle预测模型权重
├── model.pdmodel
# Paddle预测模型文件
├── calibration_table.txt
# Paddle量化后校准表
├── ONNX
│ ├── quant_model.onnx
# 量化后转出的ONNX模型
│ ├── calibration.cache
# TensorRT可以直接加载的校准表
```
#### 导出至ONNX使用TensorRT部署
执行完自动压缩后会默认在
`save_dir`
中生成
`quant_model.onnx`
的ONNX模型文件,可以直接使用TensorRT测试脚本进行验证。
加载
`quant_model.onnx`
和
`calibration.cache`
,可以直接使用TensorRT测试脚本进行验证,详细代码可参考[./TensorRT]
-
进行测试:
```
shell
python yolov7_onnx_trt.py
--model_path
=
output/quant_model.onnx
--image_file
=
images/000000570688.jpg
--precision
=
int8
cd
TensorRT
python trt_eval.py
--onnx_model_file
=
output/ONNX/quant_model.onnx
\
--calibration_file
=
output/ONNX/calibration.cache
\
--image_file
=
../images/000000570688.jpg
\
--precision_mode
=
int8
```
#### Paddle-TensorRT部署
...
...
example/auto_compression/pytorch_yolo_series/TensorRT/README.md
0 → 100644
浏览文件 @
6cee8749
# TensorRT Python预测
### 验证COCO mAP
-FP16
```
shell
python trt_eval.py
--onnx_model_file
=
yolov7_tiny_quant_onnx/yolov7-tiny.onnx
\
--precision_mode
=
fp16
\
--dataset_dir
=
dataset/coco/
\
--val_image_dir
=
val2017
\
--val_anno_path
=
annotations/instances_val2017.json
```
-
INT8
```
shell
python trt_eval.py
--onnx_model_file
=
yolov7_tiny_quant_onnx/yolov7_tiny_quant.onnx
\
--calibration_file
=
yolov7_tiny_quant_onnx/calibration.cache
\
--precision_mode
=
int8
\
--dataset_dir
=
dataset/coco/
\
--val_image_dir
=
val2017
\
--val_anno_path
=
annotations/instances_val2017.json
```
### 验证单张图片
-
FP16
```
shell
python trt_eval.py
--onnx_model_file
=
yolov7-tiny.onnx
--image_file
=
../images/000000570688.jpg
--precision_mode
=
fp16
```
-
INT8
```
shell
python trt_eval.py
--onnx_model_file
=
yolov7_tiny_quant_onnx/yolov7_tiny_quant.onnx
\
--calibration_file
=
yolov7_tiny_quant_onnx/calibration.cache
\
--image_file
=
../images/000000570688.jpg
\
--precision_mode
=
int8
```
### FAQ
-
测试内存和显存占用时,首次运行会将ONNX模型转换成TRT模型,耗时不准确,再此运行trt_eval.py可获取真实的内存和显存占用。
example/auto_compression/pytorch_yolo_series/TensorRT/trt_backend.py
0 → 100644
浏览文件 @
6cee8749
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import
tensorrt
as
trt
import
pycuda.driver
as
cuda
import
pycuda.autoinit
import
sys
import
os
import
copy
import
numpy
as
np
EXPLICIT_BATCH
=
1
<<
(
int
)(
trt
.
NetworkDefinitionCreationFlag
.
EXPLICIT_BATCH
)
EXPLICIT_PRECISION
=
1
<<
(
int
)(
trt
.
NetworkDefinitionCreationFlag
.
EXPLICIT_PRECISION
)
class
LoadCalibrator
(
trt
.
IInt8EntropyCalibrator2
):
def
__init__
(
self
,
cache_file
=
"calibration.cache"
):
super
().
__init__
()
self
.
cache_file
=
cache_file
def
get_batch_size
(
self
):
return
1
def
read_calibration_cache
(
self
):
# If there is a cache, use it instead of calibrating again. Otherwise, implicitly return None.
if
os
.
path
.
exists
(
self
.
cache_file
):
with
open
(
self
.
cache_file
,
"rb"
)
as
f
:
print
(
"Using calibration cache to save time: {:}"
.
format
(
self
.
cache_file
))
return
f
.
read
()
def
remove_initializer_from_input
(
ori_model
):
model
=
copy
.
deepcopy
(
ori_model
)
if
model
.
ir_version
<
4
:
print
(
'Model with ir_version below 4 requires to include initilizer in graph input'
)
return
inputs
=
model
.
graph
.
input
name_to_input
=
{}
for
input
in
inputs
:
name_to_input
[
input
.
name
]
=
input
for
initializer
in
model
.
graph
.
initializer
:
if
initializer
.
name
in
name_to_input
:
inputs
.
remove
(
name_to_input
[
initializer
.
name
])
return
model
# Simple helper data class that's a little nicer to use than a 2-tuple.
class
HostDeviceMem
(
object
):
def
__init__
(
self
,
host_mem
,
device_mem
):
self
.
host
=
host_mem
self
.
device
=
device_mem
if
host_mem
:
self
.
nbytes
=
host_mem
.
nbytes
else
:
self
.
nbytes
=
0
def
__str__
(
self
):
return
"Host:
\n
"
+
str
(
self
.
host
)
+
"
\n
Device:
\n
"
+
str
(
self
.
device
)
def
__repr__
(
self
):
return
self
.
__str__
()
class
TrtEngine
:
def
__init__
(
self
,
onnx_model_file
,
shape_info
=
None
,
max_batch_size
=
None
,
precision_mode
=
"fp32"
,
engine_file_path
=
None
,
calibration_cache_file
=
"calibration.cache"
,
verbose
=
False
):
self
.
max_batch_size
=
1
if
max_batch_size
is
None
else
max_batch_size
precision_mode
=
precision_mode
.
lower
()
assert
precision_mode
in
[
"fp32"
,
"fp16"
,
"int8"
],
"precision_mode must be fp32, fp16 or int8, but your precision_mode is: {}"
.
format
(
precision_mode
)
use_int8
=
precision_mode
==
"int8"
use_fp16
=
precision_mode
==
"fp16"
TRT_LOGGER
=
trt
.
Logger
()
if
verbose
:
TRT_LOGGER
=
trt
.
Logger
(
trt
.
Logger
.
VERBOSE
)
if
engine_file_path
is
not
None
and
os
.
path
.
exists
(
engine_file_path
):
# If a serialized engine exists, use it instead of building an engine.
print
(
"[TRT Backend] Reading engine from file {}"
.
format
(
engine_file_path
))
with
open
(
engine_file_path
,
"rb"
)
as
f
,
trt
.
Runtime
(
TRT_LOGGER
)
as
runtime
:
self
.
engine
=
runtime
.
deserialize_cuda_engine
(
f
.
read
())
else
:
builder
=
trt
.
Builder
(
TRT_LOGGER
)
config
=
builder
.
create_builder_config
()
network
=
None
if
use_int8
and
not
builder
.
platform_has_fast_int8
:
print
(
"[TRT Backend] INT8 not supported on this platform."
)
if
use_fp16
and
not
builder
.
platform_has_fast_fp16
:
print
(
"[TRT Backend] FP16 not supported on this platform."
)
if
use_int8
and
builder
.
platform_has_fast_int8
:
print
(
"[TRT Backend] Use INT8."
)
network
=
builder
.
create_network
(
EXPLICIT_BATCH
|
EXPLICIT_PRECISION
)
config
.
int8_calibrator
=
LoadCalibrator
(
calibration_cache_file
)
config
.
set_flag
(
trt
.
BuilderFlag
.
INT8
)
elif
use_fp16
and
builder
.
platform_has_fast_fp16
:
print
(
"[TRT Backend] Use FP16."
)
network
=
builder
.
create_network
(
EXPLICIT_BATCH
)
config
.
set_flag
(
trt
.
BuilderFlag
.
FP16
)
else
:
print
(
"[TRT Backend] Use FP32."
)
network
=
builder
.
create_network
(
EXPLICIT_BATCH
)
parser
=
trt
.
OnnxParser
(
network
,
TRT_LOGGER
)
runtime
=
trt
.
Runtime
(
TRT_LOGGER
)
config
.
max_workspace_size
=
1
<<
28
import
onnx
print
(
"[TRT Backend] Loading ONNX model ..."
)
onnx_model
=
onnx_model_file
if
not
isinstance
(
onnx_model_file
,
onnx
.
ModelProto
):
onnx_model
=
onnx
.
load
(
onnx_model_file
)
onnx_model
=
remove_initializer_from_input
(
onnx_model
)
if
not
parser
.
parse
(
onnx_model
.
SerializeToString
()):
for
error
in
range
(
parser
.
num_errors
):
print
(
parser
.
get_error
(
error
))
raise
Exception
(
"ERROR: Failed to parse the ONNX file."
)
if
shape_info
is
None
:
builder
.
max_batch_size
=
1
for
i
in
range
(
len
(
onnx_model
.
graph
.
input
)):
input_shape
=
[
x
.
dim_value
for
x
in
onnx_model
.
graph
.
input
[
0
]
.
type
.
tensor_type
.
shape
.
dim
]
for
s
in
input_shape
:
assert
s
>
0
,
"In static shape mode, the input of onnx model should be fixed, but now it's {}"
.
format
(
onnx_model
.
graph
.
input
[
i
])
else
:
max_batch_size
=
1
if
shape_info
is
not
None
:
assert
len
(
shape_info
)
==
network
.
num_inputs
,
"Length of shape_info: {} is not same with length of model input: {}"
.
format
(
len
(
shape_info
),
network
.
num_inputs
)
profile
=
builder
.
create_optimization_profile
()
for
k
,
v
in
shape_info
.
items
():
if
v
[
2
][
0
]
>
max_batch_size
:
max_batch_size
=
v
[
2
][
0
]
print
(
"[TRT Backend] optimize shape: "
,
k
,
v
[
0
],
v
[
1
],
v
[
2
])
profile
.
set_shape
(
k
,
v
[
0
],
v
[
1
],
v
[
2
])
config
.
add_optimization_profile
(
profile
)
if
max_batch_size
>
self
.
max_batch_size
:
self
.
max_batch_size
=
max_batch_size
builder
.
max_batch_size
=
self
.
max_batch_size
print
(
"[TRT Backend] Completed parsing of ONNX file."
)
print
(
"[TRT Backend] Building an engine from onnx model may take a while..."
)
plan
=
builder
.
build_serialized_network
(
network
,
config
)
print
(
"[TRT Backend] Start Creating Engine."
)
self
.
engine
=
runtime
.
deserialize_cuda_engine
(
plan
)
print
(
"[TRT Backend] Completed Creating Engine."
)
if
engine_file_path
is
not
None
:
with
open
(
engine_file_path
,
"wb"
)
as
f
:
f
.
write
(
self
.
engine
.
serialize
())
self
.
context
=
self
.
engine
.
create_execution_context
()
if
shape_info
is
not
None
:
self
.
context
.
active_optimization_profile
=
0
self
.
stream
=
cuda
.
Stream
()
self
.
bindings
=
[]
self
.
inputs
=
[]
self
.
outputs
=
[]
for
binding
in
self
.
engine
:
self
.
bindings
.
append
(
0
)
if
self
.
engine
.
binding_is_input
(
binding
):
self
.
inputs
.
append
(
HostDeviceMem
(
None
,
None
))
else
:
self
.
outputs
.
append
(
HostDeviceMem
(
None
,
None
))
print
(
"[TRT Backend] Completed TrtEngine init ..."
)
def
infer
(
self
,
input_data
):
assert
len
(
self
.
inputs
)
==
len
(
input_data
),
"Length of input_data: {} is not same with length of input: {}"
.
format
(
len
(
input_data
),
len
(
self
.
inputs
))
self
.
allocate_buffers
(
input_data
)
return
self
.
do_inference_v2
(
self
.
context
,
bindings
=
self
.
bindings
,
inputs
=
self
.
inputs
,
outputs
=
self
.
outputs
,
stream
=
self
.
stream
)
def
do_inference_v2
(
self
,
context
,
bindings
,
inputs
,
outputs
,
stream
):
# Transfer input data to the GPU.
[
cuda
.
memcpy_htod_async
(
inp
.
device
,
inp
.
host
,
stream
)
for
inp
in
inputs
]
# Run inference.
context
.
execute_async_v2
(
bindings
=
bindings
,
stream_handle
=
stream
.
handle
)
# Transfer predictions back from the GPU.
[
cuda
.
memcpy_dtoh_async
(
out
.
host
,
out
.
device
,
stream
)
for
out
in
outputs
]
# Synchronize the stream
stream
.
synchronize
()
# Return only the host outputs.
return
[
out
.
host
for
out
in
outputs
]
def
allocate_buffers
(
self
,
input_data
):
input_idx
=
0
output_idx
=
0
for
binding
in
self
.
engine
:
idx
=
self
.
engine
.
get_binding_index
(
binding
)
if
self
.
engine
.
binding_is_input
(
binding
):
if
not
input_data
[
input_idx
].
flags
[
'C_CONTIGUOUS'
]:
input_data
[
input_idx
]
=
np
.
ascontiguousarray
(
input_data
[
input_idx
])
self
.
context
.
set_binding_shape
(
idx
,
(
input_data
[
input_idx
].
shape
))
self
.
inputs
[
input_idx
].
host
=
input_data
[
input_idx
]
nbytes
=
input_data
[
input_idx
].
nbytes
if
self
.
inputs
[
input_idx
].
nbytes
<
nbytes
:
self
.
inputs
[
input_idx
].
nbytes
=
nbytes
self
.
inputs
[
input_idx
].
device
=
cuda
.
mem_alloc
(
nbytes
)
self
.
bindings
[
idx
]
=
int
(
self
.
inputs
[
input_idx
].
device
)
input_idx
+=
1
else
:
dtype
=
trt
.
nptype
(
self
.
engine
.
get_binding_dtype
(
binding
))
shape
=
self
.
context
.
get_binding_shape
(
idx
)
self
.
outputs
[
output_idx
].
host
=
np
.
ascontiguousarray
(
np
.
empty
(
shape
,
dtype
=
dtype
))
nbytes
=
self
.
outputs
[
output_idx
].
host
.
nbytes
if
self
.
outputs
[
output_idx
].
nbytes
<
nbytes
:
self
.
outputs
[
output_idx
].
nbytes
=
nbytes
self
.
outputs
[
output_idx
].
device
=
cuda
.
mem_alloc
(
self
.
outputs
[
output_idx
].
host
.
nbytes
)
self
.
bindings
[
idx
]
=
int
(
self
.
outputs
[
output_idx
].
device
)
output_idx
+=
1
example/auto_compression/pytorch_yolo_series/
onnx_trt_infer
.py
→
example/auto_compression/pytorch_yolo_series/
TensorRT/trt_eval
.py
浏览文件 @
6cee8749
...
...
@@ -12,19 +12,68 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import
numpy
as
np
import
cv2
import
tensorrt
as
trt
import
pycuda.driver
as
cuda
import
pycuda.autoinit
import
warnings
warnings
.
filterwarnings
(
"ignore"
)
import
os
import
time
import
random
import
sys
import
numpy
as
np
import
argparse
from
tqdm
import
tqdm
import
pkg_resources
as
pkg
import
time
import
cv2
import
paddle
import
onnx
sys
.
path
.
append
(
"../"
)
from
post_process
import
YOLOPostProcess
,
coco_metric
from
dataset
import
COCOValDataset
import
trt_backend
def
argsparser
():
parser
=
argparse
.
ArgumentParser
(
description
=
__doc__
)
parser
.
add_argument
(
'--onnx_model_file'
,
type
=
str
,
default
=
'yolov7_tiny_quant_onnx/quant_model.onnx'
,
help
=
"onnx model file path."
)
parser
.
add_argument
(
'--calibration_file'
,
type
=
str
,
default
=
'yolov7_tiny_quant_onnx/calibration.cache'
,
help
=
"quant onnx model calibration cache file."
)
parser
.
add_argument
(
'--image_file'
,
type
=
str
,
default
=
None
,
help
=
"image path, if set image_file, it will not eval coco."
)
parser
.
add_argument
(
'--dataset_dir'
,
type
=
str
,
default
=
'dataset/coco'
,
help
=
"COCO dataset dir."
)
parser
.
add_argument
(
'--val_image_dir'
,
type
=
str
,
default
=
'val2017'
,
help
=
"COCO dataset val image dir."
)
parser
.
add_argument
(
'--val_anno_path'
,
type
=
str
,
default
=
'annotations/instances_val2017.json'
,
help
=
"COCO dataset anno path."
)
parser
.
add_argument
(
'--precision_mode'
,
type
=
str
,
default
=
'fp32'
,
help
=
"support fp32/fp16/int8."
)
parser
.
add_argument
(
'--batch_size'
,
type
=
int
,
default
=
1
,
help
=
"Batch size of model input."
)
return
parser
EXPLICIT_BATCH
=
1
<<
(
int
)(
trt
.
NetworkDefinitionCreationFlag
.
EXPLICIT_BATCH
)
EXPLICIT_PRECISION
=
1
<<
(
int
)(
trt
.
NetworkDefinitionCreationFlag
.
EXPLICIT_PRECISION
)
# load coco labels
CLASS_LABEL
=
[
...
...
@@ -68,72 +117,6 @@ def preprocess(image, input_size, mean=None, std=None, swap=(2, 0, 1)):
return
padded_img
,
r
def
postprocess
(
predictions
,
ratio
):
boxes
=
predictions
[:,
:
4
]
scores
=
predictions
[:,
4
:
5
]
*
predictions
[:,
5
:]
boxes_xyxy
=
np
.
ones_like
(
boxes
)
boxes_xyxy
[:,
0
]
=
boxes
[:,
0
]
-
boxes
[:,
2
]
/
2.
boxes_xyxy
[:,
1
]
=
boxes
[:,
1
]
-
boxes
[:,
3
]
/
2.
boxes_xyxy
[:,
2
]
=
boxes
[:,
0
]
+
boxes
[:,
2
]
/
2.
boxes_xyxy
[:,
3
]
=
boxes
[:,
1
]
+
boxes
[:,
3
]
/
2.
boxes_xyxy
/=
ratio
dets
=
multiclass_nms
(
boxes_xyxy
,
scores
,
nms_thr
=
0.45
,
score_thr
=
0.1
)
return
dets
def
nms
(
boxes
,
scores
,
nms_thr
):
"""Single class NMS implemented in Numpy."""
x1
=
boxes
[:,
0
]
y1
=
boxes
[:,
1
]
x2
=
boxes
[:,
2
]
y2
=
boxes
[:,
3
]
areas
=
(
x2
-
x1
+
1
)
*
(
y2
-
y1
+
1
)
order
=
scores
.
argsort
()[::
-
1
]
keep
=
[]
while
order
.
size
>
0
:
i
=
order
[
0
]
keep
.
append
(
i
)
xx1
=
np
.
maximum
(
x1
[
i
],
x1
[
order
[
1
:]])
yy1
=
np
.
maximum
(
y1
[
i
],
y1
[
order
[
1
:]])
xx2
=
np
.
minimum
(
x2
[
i
],
x2
[
order
[
1
:]])
yy2
=
np
.
minimum
(
y2
[
i
],
y2
[
order
[
1
:]])
w
=
np
.
maximum
(
0.0
,
xx2
-
xx1
+
1
)
h
=
np
.
maximum
(
0.0
,
yy2
-
yy1
+
1
)
inter
=
w
*
h
ovr
=
inter
/
(
areas
[
i
]
+
areas
[
order
[
1
:]]
-
inter
)
inds
=
np
.
where
(
ovr
<=
nms_thr
)[
0
]
order
=
order
[
inds
+
1
]
return
keep
def
multiclass_nms
(
boxes
,
scores
,
nms_thr
,
score_thr
):
"""Multiclass NMS implemented in Numpy"""
final_dets
=
[]
num_classes
=
scores
.
shape
[
1
]
for
cls_ind
in
range
(
num_classes
):
cls_scores
=
scores
[:,
cls_ind
]
valid_score_mask
=
cls_scores
>
score_thr
if
valid_score_mask
.
sum
()
==
0
:
continue
else
:
valid_scores
=
cls_scores
[
valid_score_mask
]
valid_boxes
=
boxes
[
valid_score_mask
]
keep
=
nms
(
valid_boxes
,
valid_scores
,
nms_thr
)
if
len
(
keep
)
>
0
:
cls_inds
=
np
.
ones
((
len
(
keep
),
1
))
*
cls_ind
dets
=
np
.
concatenate
(
[
valid_boxes
[
keep
],
valid_scores
[
keep
,
None
],
cls_inds
],
1
)
final_dets
.
append
(
dets
)
if
len
(
final_dets
)
==
0
:
return
None
return
np
.
concatenate
(
final_dets
,
0
)
def
get_color_map_list
(
num_classes
):
color_map
=
num_classes
*
[
0
,
0
,
0
]
for
i
in
range
(
0
,
num_classes
):
...
...
@@ -181,198 +164,158 @@ def draw_box(img, boxes, scores, cls_ids, conf=0.5, class_names=None):
return
img
def
get_engine
(
precision
,
model_file_path
):
# TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE)
TRT_LOGGER
=
trt
.
Logger
()
builder
=
trt
.
Builder
(
TRT_LOGGER
)
config
=
builder
.
create_builder_config
()
if
precision
==
'int8'
:
network
=
builder
.
create_network
(
EXPLICIT_BATCH
|
EXPLICIT_PRECISION
)
def
get_current_memory_mb
():
"""
It is used to Obtain the memory usage of the CPU and GPU during the running of the program.
And this function Current program is time-consuming.
"""
try
:
pkg
.
require
(
'pynvml'
)
except
:
from
pip._internal
import
main
main
([
'install'
,
'pynvml'
])
try
:
pkg
.
require
(
'psutil'
)
except
:
from
pip._internal
import
main
main
([
'install'
,
'psutil'
])
try
:
pkg
.
require
(
'GPUtil'
)
except
:
from
pip._internal
import
main
main
([
'install'
,
'GPUtil'
])
import
pynvml
import
psutil
import
GPUtil
gpu_id
=
int
(
os
.
environ
.
get
(
'CUDA_VISIBLE_DEVICES'
,
0
))
pid
=
os
.
getpid
()
p
=
psutil
.
Process
(
pid
)
info
=
p
.
memory_full_info
()
cpu_mem
=
info
.
uss
/
1024.
/
1024.
gpu_mem
=
0
gpu_percent
=
0
gpus
=
GPUtil
.
getGPUs
()
if
gpu_id
is
not
None
and
len
(
gpus
)
>
0
:
gpu_percent
=
gpus
[
gpu_id
].
load
pynvml
.
nvmlInit
()
handle
=
pynvml
.
nvmlDeviceGetHandleByIndex
(
0
)
meminfo
=
pynvml
.
nvmlDeviceGetMemoryInfo
(
handle
)
gpu_mem
=
meminfo
.
used
/
1024.
/
1024.
return
round
(
cpu_mem
,
4
),
round
(
gpu_mem
,
4
)
def
load_trt_engine
():
model
=
onnx
.
load
(
FLAGS
.
onnx_model_file
)
model_name
=
os
.
path
.
split
(
FLAGS
.
onnx_model_file
)[
-
1
].
rstrip
(
'.onnx'
)
if
FLAGS
.
precision_mode
==
"int8"
:
engine_file
=
"{}_quant_model.trt"
.
format
(
model_name
)
assert
os
.
path
.
exists
(
FLAGS
.
calibration_file
)
trt_engine
=
trt_backend
.
TrtEngine
(
model
,
max_batch_size
=
1
,
precision_mode
=
FLAGS
.
precision_mode
,
engine_file_path
=
engine_file
,
calibration_cache_file
=
FLAGS
.
calibration_file
)
else
:
network
=
builder
.
create_network
(
EXPLICIT_BATCH
)
parser
=
trt
.
OnnxParser
(
network
,
TRT_LOGGER
)
runtime
=
trt
.
Runtime
(
TRT_LOGGER
)
if
model_file_path
.
endswith
(
'.trt'
):
# If a serialized engine exists, use it instead of building an engine.
print
(
"Reading engine from file {}"
.
format
(
model_file_path
))
with
open
(
model_file_path
,
"rb"
)
as
f
,
trt
.
Runtime
(
TRT_LOGGER
)
as
runtime
:
engine
=
runtime
.
deserialize_cuda_engine
(
f
.
read
())
for
i
in
range
(
network
.
num_layers
):
layer
=
network
.
get_layer
(
i
)
print
(
i
,
layer
.
name
)
return
engine
else
:
config
.
max_workspace_size
=
1
<<
30
if
precision
==
"fp16"
:
if
not
builder
.
platform_has_fast_fp16
:
print
(
"FP16 is not supported natively on this platform/device"
)
else
:
config
.
set_flag
(
trt
.
BuilderFlag
.
FP16
)
elif
precision
==
"int8"
:
if
not
builder
.
platform_has_fast_int8
:
print
(
"INT8 is not supported natively on this platform/device"
)
else
:
if
builder
.
platform_has_fast_fp16
:
# Also enable fp16, as some layers may be even more efficient in fp16 than int8
config
.
set_flag
(
trt
.
BuilderFlag
.
FP16
)
config
.
set_flag
(
trt
.
BuilderFlag
.
INT8
)
builder
.
max_batch_size
=
1
print
(
'Loading ONNX file from path {}...'
.
format
(
model_file_path
))
with
open
(
model_file_path
,
'rb'
)
as
model
:
print
(
'Beginning ONNX file parsing'
)
if
not
parser
.
parse
(
model
.
read
()):
print
(
'ERROR: Failed to parse the ONNX file.'
)
for
error
in
range
(
parser
.
num_errors
):
print
(
parser
.
get_error
(
error
))
return
None
print
(
'Completed parsing of ONNX file'
)
print
(
'Building an engine from file {}; this may take a while...'
.
format
(
model_file_path
))
plan
=
builder
.
build_serialized_network
(
network
,
config
)
engine
=
runtime
.
deserialize_cuda_engine
(
plan
)
print
(
"Completed creating Engine"
)
with
open
(
model_file_path
,
"wb"
)
as
f
:
f
.
write
(
engine
.
serialize
())
for
i
in
range
(
network
.
num_layers
):
layer
=
network
.
get_layer
(
i
)
print
(
i
,
layer
.
name
)
return
engine
# Simple helper data class that's a little nicer to use than a 2-tuple.
class
HostDeviceMem
(
object
):
def
__init__
(
self
,
host_mem
,
device_mem
):
self
.
host
=
host_mem
self
.
device
=
device_mem
def
__str__
(
self
):
return
"Host:
\n
"
+
str
(
self
.
host
)
+
"
\n
Device:
\n
"
+
str
(
self
.
device
)
def
__repr__
(
self
):
return
self
.
__str__
()
def
allocate_buffers
(
engine
):
inputs
=
[]
outputs
=
[]
bindings
=
[]
stream
=
cuda
.
Stream
()
for
binding
in
engine
:
size
=
trt
.
volume
(
engine
.
get_binding_shape
(
binding
))
*
engine
.
max_batch_size
dtype
=
trt
.
nptype
(
engine
.
get_binding_dtype
(
binding
))
# Allocate host and device buffers
host_mem
=
cuda
.
pagelocked_empty
(
size
,
dtype
)
device_mem
=
cuda
.
mem_alloc
(
host_mem
.
nbytes
)
# Append the device buffer to device bindings.
bindings
.
append
(
int
(
device_mem
))
# Append to the appropriate list.
if
engine
.
binding_is_input
(
binding
):
inputs
.
append
(
HostDeviceMem
(
host_mem
,
device_mem
))
else
:
outputs
.
append
(
HostDeviceMem
(
host_mem
,
device_mem
))
return
inputs
,
outputs
,
bindings
,
stream
def
run_inference
(
context
,
bindings
,
inputs
,
outputs
,
stream
):
# Transfer input data to the GPU.
[
cuda
.
memcpy_htod_async
(
inp
.
device
,
inp
.
host
,
stream
)
for
inp
in
inputs
]
# Run inference.
context
.
execute_async_v2
(
bindings
=
bindings
,
stream_handle
=
stream
.
handle
)
# Transfer predictions back from the GPU.
[
cuda
.
memcpy_dtoh_async
(
out
.
host
,
out
.
device
,
stream
)
for
out
in
outputs
]
# Synchronize the stream
stream
.
synchronize
()
# Return only the host outputs.
return
[
out
.
host
for
out
in
outputs
]
def
main
(
args
):
onnx_model
=
args
.
model_path
img_path
=
args
.
image_file
num_class
=
len
(
CLASS_LABEL
)
repeat
=
1000
engine
=
get_engine
(
args
.
precision
,
onnx_model
)
model_all_names
=
[]
for
idx
in
range
(
engine
.
num_bindings
):
is_input
=
engine
.
binding_is_input
(
idx
)
name
=
engine
.
get_binding_name
(
idx
)
op_type
=
engine
.
get_binding_dtype
(
idx
)
model_all_names
.
append
(
name
)
shape
=
engine
.
get_binding_shape
(
idx
)
print
(
'input id:'
,
idx
,
' is input: '
,
is_input
,
' binding name:'
,
name
,
' shape:'
,
shape
,
'type: '
,
op_type
)
context
=
engine
.
create_execution_context
()
print
(
'Allocate buffers ...'
)
inputs
,
outputs
,
bindings
,
stream
=
allocate_buffers
(
engine
)
print
(
"TRT set input ..."
)
origin_img
=
cv2
.
imread
(
img_path
)
input_shape
=
[
args
.
img_shape
,
args
.
img_shape
]
input_image
,
ratio
=
preprocess
(
origin_img
,
input_shape
)
inputs
[
0
].
host
=
np
.
expand_dims
(
input_image
,
axis
=
0
)
for
_
in
range
(
0
,
50
):
trt_outputs
=
run_inference
(
context
,
bindings
=
bindings
,
inputs
=
inputs
,
outputs
=
outputs
,
stream
=
stream
)
time1
=
time
.
time
()
engine_file
=
"{}_{}_model.trt"
.
format
(
model_name
,
FLAGS
.
precision_mode
)
trt_engine
=
trt_backend
.
TrtEngine
(
model
,
max_batch_size
=
1
,
precision_mode
=
FLAGS
.
precision_mode
,
engine_file_path
=
engine_file
)
return
trt_engine
def
eval
():
trt_engine
=
load_trt_engine
()
bboxes_list
,
bbox_nums_list
,
image_id_list
=
[],
[],
[]
cpu_mems
,
gpu_mems
=
0
,
0
sample_nums
=
len
(
val_loader
)
with
tqdm
(
total
=
sample_nums
,
bar_format
=
'Evaluation stage, Run batch:|{bar}| {n_fmt}/{total_fmt}'
,
ncols
=
80
)
as
t
:
for
data
in
val_loader
:
data_all
=
{
k
:
np
.
array
(
v
)
for
k
,
v
in
data
.
items
()}
outs
=
trt_engine
.
infer
([
data_all
[
'image'
]])
outs
=
np
.
array
(
outs
).
reshape
(
1
,
-
1
,
85
)
postprocess
=
YOLOPostProcess
(
score_threshold
=
0.001
,
nms_threshold
=
0.65
,
multi_label
=
True
)
res
=
postprocess
(
np
.
array
(
outs
),
data_all
[
'scale_factor'
])
bboxes_list
.
append
(
res
[
'bbox'
])
bbox_nums_list
.
append
(
res
[
'bbox_num'
])
image_id_list
.
append
(
np
.
array
(
data_all
[
'im_id'
]))
cpu_mem
,
gpu_mem
=
get_current_memory_mb
()
cpu_mems
+=
cpu_mem
gpu_mems
+=
gpu_mem
t
.
update
()
print
(
'Avg cpu_mem:{} MB, avg gpu_mem: {} MB'
.
format
(
cpu_mems
/
sample_nums
,
gpu_mems
/
sample_nums
))
coco_metric
(
anno_file
,
bboxes_list
,
bbox_nums_list
,
image_id_list
)
def
infer
():
origin_img
=
cv2
.
imread
(
FLAGS
.
image_file
)
input_shape
=
[
640
,
640
]
input_image
,
scale_factor
=
preprocess
(
origin_img
,
input_shape
)
input_image
=
np
.
expand_dims
(
input_image
,
axis
=
0
)
scale_factor
=
np
.
array
([[
scale_factor
,
scale_factor
]])
trt_engine
=
load_trt_engine
()
repeat
=
100
cpu_mems
,
gpu_mems
=
0
,
0
for
_
in
range
(
0
,
repeat
):
trt_outputs
=
run_inference
(
context
,
bindings
=
bindings
,
inputs
=
inputs
,
outputs
=
outputs
,
stream
=
stream
)
time2
=
time
.
time
()
# total time cost(ms)
total_inference_cost
=
(
time2
-
time1
)
*
1000
print
(
"model path: "
,
onnx_model
,
" precision: "
,
args
.
precision
)
print
(
"In TensorRT, "
,
"average latency is : {} ms"
.
format
(
total_inference_cost
/
repeat
))
outs
=
trt_engine
.
infer
(
input_image
)
cpu_mem
,
gpu_mem
=
get_current_memory_mb
()
cpu_mems
+=
cpu_mem
gpu_mems
+=
gpu_mem
print
(
'Avg cpu_mem:{} MB, avg gpu_mem: {} MB'
.
format
(
cpu_mems
/
repeat
,
gpu_mems
/
repeat
))
# Do postprocess
output
=
trt_outputs
[
0
]
predictions
=
np
.
reshape
(
output
,
(
1
,
-
1
,
int
(
5
+
num_class
)))[
0
]
dets
=
postprocess
(
predictions
,
ratio
)
outs
=
np
.
array
(
outs
).
reshape
(
1
,
-
1
,
85
)
postprocess
=
YOLOPostProcess
(
score_threshold
=
0.1
,
nms_threshold
=
0.45
,
multi_label
=
False
)
res
=
postprocess
(
np
.
array
(
outs
),
scale_factor
)
# Draw rectangles and labels on the original image
dets
=
res
[
'bbox'
]
if
dets
is
not
None
:
final_boxes
,
final_scores
,
final_cl
s_inds
=
dets
[:,
:
4
],
dets
[:,
4
],
dets
[:,
5
]
final_boxes
,
final_scores
,
final_cl
ass
=
dets
[:,
2
:],
dets
[:,
1
],
dets
[:,
0
]
origin_img
=
draw_box
(
origin_img
,
final_boxes
,
final_scores
,
final_cl
s_ind
s
,
final_cl
as
s
,
conf
=
0.5
,
class_names
=
CLASS_LABEL
)
cv2
.
imwrite
(
'output.jpg'
,
origin_img
)
print
(
'The prediction results are saved in output.jpg.'
)
if
__name__
==
"__main__"
:
parser
=
argparse
.
ArgumentParser
()
parser
.
add_argument
(
'--model_path'
,
type
=
str
,
default
=
"quant_model.onnx"
,
help
=
"inference model filepath"
)
parser
.
add_argument
(
'--image_file'
,
type
=
str
,
default
=
"bus.jpg"
,
help
=
"image path"
)
parser
.
add_argument
(
'--precision'
,
type
=
str
,
default
=
'fp32'
,
help
=
"support fp32/fp16/int8."
)
parser
.
add_argument
(
'--img_shape'
,
type
=
int
,
default
=
640
,
help
=
"input_size"
)
args
=
parser
.
parse_args
()
main
(
args
)
def
main
():
if
FLAGS
.
image_file
:
infer
()
else
:
global
val_loader
dataset
=
COCOValDataset
(
dataset_dir
=
FLAGS
.
dataset_dir
,
image_dir
=
FLAGS
.
val_image_dir
,
anno_path
=
FLAGS
.
val_anno_path
)
global
anno_file
anno_file
=
dataset
.
ann_file
val_loader
=
paddle
.
io
.
DataLoader
(
dataset
,
batch_size
=
FLAGS
.
batch_size
,
drop_last
=
True
)
eval
()
if
__name__
==
'__main__'
:
paddle
.
enable_static
()
parser
=
argsparser
()
FLAGS
=
parser
.
parse_args
()
paddle
.
set_device
(
'cpu'
)
main
()
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录