inference_in_nvidia_gpu.rst 10.5 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208
.. _inference_in_nvidia_gpu:

Nvidia GPU测试量化模型性能
===================================

MegEngine 在Nvidia GPU方面做了很多深度优化,保证了模型推理高性能的执行,同时支持Nvidia 多种GPU硬件,如服务器端常用
的P4, T4,以及嵌入式端的Jetson TX2、TX1等。

Turing架构是Nvidia推出的最新计算架构,Turing架构的芯片引入了TensorCore int8计算
单元,能够对int8量化模型进行进一步加速。目前Turing架构的GPU显卡型号有2080Ti,T4
等,如果是在这些平台进行深度学习的推理部署,可以采用TensorCore来加速。

下文基于load_and_run工具(详见: :ref:`how_to_use_load_and_run`),在2080Ti平台上阐述
MegEngine量化模型的推理步骤。

概述
---------------------------------------------------

MegEngine提供了自动转换工具来使用int8的TensorCore。用户首先要准备好NCHW格式的
int8量化模型,MegEngine
目前支持三种利用tensorcore的方式:

1. 基于 `TensorRT <https://developer.nvidia.com/tensorrt>`_ 子图方式
2. 基于 `cuDNN <https://developer.nvidia.com/cudnn>`_
3. 基于自研的CHWN4 layout的算法

模型准备
------------------------------------

将mge模型序列化并导出到文件, 我们以 `ResNet18 <https://github.com/MegEngine/Models/blob/master/official/quantization/models/resnet.py>`_ 为例。
因为MegEngine的模型都是动态图形式(详细见: :ref:`dynamic_and_static_graph` ) ,所以我们需要先将模型转成静态图然后再部署。

具体可参考如下代码片段:

*代码片段:*

.. code-block:: python
   :linenos:

   import megengine.module as m
   import megengine.functional as f
   import numpy as np

   if __name__ == '__main__':

      import megengine.hub
      import megengine.functional as f
      from megengine.jit import trace

      net = megengine.hub.load("megengine/models", "quantized_resnet18", pretrained=True)
      net.eval()

      @trace(symbolic=True)
      def fun(data,*, net):
         pred = net(data)
         pred_normalized = f.softmax(pred)
         return pred_normalized

      data = np.random.random([128, 3, 224,
                              224]).astype(np.float32)


      fun.trace(data,net=net)
      fun.dump("resnet18.mge", arg_names=["data"], optimize_for_inference=True)

执行脚本,并完成模型转换后,我们就获得了可以通过MegEngine c++ api加载的预训练模型文件 ``resnet18.mge``, 这里我们选取batchsize=128。


输入准备
---------------------------------------

load_and_run 可以用 ``--input`` 选项直接设置模型文件的输入数据, 它支持.ppm/.pgm/.json/.npy等多种格式输入

测试输入图片如下:

.. figure::
    ./fig/cat.jpg

    图1 猫


因为模型的输入是float32, 且是NCHW, 需要先将图片转成npy格式。

.. code-block:: python
   :linenos:

   import cv2
   import numpy as np

   cat = cv2.imread('./cat.jpg')
   cat = cat[np.newaxis]  # 将cat的shape从(224,224,3) 变成 (1, 224, 224, 3)
   cat = np.transpose(cat, (0, 3, 1, 2)) # nhwc -> nchw
   cat = np.repeat(cat, 128, axis=0) # repeat to (128, 3, 224, 224)

   np.save('cat.npy', np.float32(cat))

编译load_and_run
-------------------------------------

详见: :ref:`how_to_use_load_and_run`


基于TensorRT子图
-------------------------------------

NVIDIA `TensorRT <https://developer.nvidia.com/tensorrt>`_ 是一个高性能的深度学习推理库,
MegEngine可以基于子图的方式对TensorRT进行集成。
在模型加载的时候,通过图优化的方式遍历全图,识别出适用于TensorRT执行的算子,构成一个个连通子图,将这些子图转换成TensorRT算子,
在运行期间,对于TensorRT算子自动调用TensorRT来执行。

因为目前TensorRT子图优化pass是针对NCHW4的layout开发的,所以对于NCHW的网络,需要额外带上 ``--enable-nchw4`` 将NCHW网络转成NCHW4,然后再转成TensorRT子图。

下面所有的实验都开启了fastrun,关于fastrun的详细原理见: :ref:`how_to_use_load_and_run` 。


.. code-block:: bash

    ./load_and_run ./resnet18.mge --input ./cat.npy --enable-nchw4 --tensorrt --fast-run
    mgb load-and-run: using MegBrain 8.4.1(0) and MegDNN 9.3.0
    [03 21:26:59 from_argv@mgblar.cpp:1167][WARN] enable nchw4 optimization
    [03 21:26:59 from_argv@mgblar.cpp:1143][WARN] use tensorrt mode
    load model: 4264.378ms
    [03 21:27:03 operator()@opr_replace.cpp:729][WARN] Both operands of Elemwise are newly prepared. This is rare. Please check. opr=ADD(multi_dv[0]:o41,reshape[1592])[1594] inputs=0={id:42, layout:{1(1000),1000(1)}, Float32, owner:multi_dv[0]{MultipleDeviceTensorHolder}, name:multi_dv[0]:o41, slot:41, gpu0:0, s, 2, 1} 1={id:1593, shape:{128,1000}, Float32, owner:reshape(matrix_mul[1585])[1592]{Reshape}, name:reshape(matrix_mul[1585])[1592], slot:0, gpu0:0, s, 4, 8}
    [03 21:27:03 operator()@opr_replace.cpp:729][WARN] Both operands of Elemwise are newly prepared. This is rare. Please check. opr=SUB(ADD[1594],reduce4[1596])[1599] inputs=0={id:1595, shape:{128,1000}, Float32, owner:ADD(multi_dv[0]:o41,reshape[1592])[1594]{Elemwise}, name:ADD(multi_dv[0]:o41,reshape[1592])[1594], slot:0, gpu0:0, s, 4, 8} 1={id:1597, shape:{128,1}, Float32, owner:reduce4(ADD[1594])[1596]{Reduce}, name:reduce4(ADD[1594])[1596], slot:0, gpu0:0, s, 4, 8}
    [03 21:27:03 operator()@opr_replace.cpp:729][WARN] Both operands of Elemwise are newly prepared. This is rare. Please check. opr=TRUE_DIV(EXP[1601],reduce0[1603])[1606] inputs=0={id:1602, shape:{128,1000}, Float32, owner:EXP(SUB[1599])[1601]{Elemwise}, name:EXP(SUB[1599])[1601], slot:0, gpu0:0, s, 4, 8} 1={id:1604, shape:{128,1}, Float32, owner:reduce0(EXP[1601])[1603]{Reduce}, name:reduce0(EXP[1601])[1603], slot:0, gpu0:0, s, 4, 8}
    [03 21:27:16 get_output_var_shape@tensorrt_opr.cpp:549][WARN] TensorRTOpr(name:tensor_rt(relayout_format[419])[2500]) engine build time 13010.89 ms
    [03 21:27:16 get_output_var_shape@tensorrt_opr.cpp:549][WARN] TensorRTOpr(name:tensor_rt(reshape[2537])[2539]) engine build time 17.50 ms
    [03 21:27:16 get_output_var_shape@tensorrt_opr.cpp:549][WARN] TensorRTOpr(name:tensor_rt(multi_dv[0]:o41)[2548]) engine build time 14.38 ms
    [03 21:27:16 get_output_var_shape@tensorrt_opr.cpp:549][WARN] TensorRTOpr(name:tensor_rt(tensor_rt[2548])[2554]) engine build time 23.57 ms
    [03 21:27:16 get_output_var_shape@tensorrt_opr.cpp:549][WARN] TensorRTOpr(name:tensor_rt(tensor_rt[2554])[2560]) engine build time 15.49 ms
    === prepare: 13211.884ms; going to warmup
    warmup 0: 32.548ms
    === going to run input for 10 times
    iter 0/10: 7.592ms (exec=0.320,device=7.540)
    iter 1/10: 7.023ms (exec=0.282,device=6.993)
    iter 2/10: 5.804ms (exec=0.300,device=5.773)
    iter 3/10: 5.721ms (exec=0.275,device=5.691)
    iter 4/10: 5.728ms (exec=0.282,device=5.697)
    iter 5/10: 5.824ms (exec=0.270,device=5.794)
    iter 6/10: 5.845ms (exec=0.278,device=5.816)
    iter 7/10: 6.031ms (exec=0.277,device=6.004)
    iter 8/10: 6.042ms (exec=0.275,device=6.013)
    iter 9/10: 6.046ms (exec=0.276,device=6.019)
    === finished test #0: time=61.656ms avg_time=6.166ms sd=0.629ms minmax=5.721,7.592


基于cuDNN
-----------------------------------------

`cuDNN <https://developer.nvidia.com/cudnn>`_ 是Nvidia 针对GPU开发深度学习原语库,它提供了很多高度优化的算子如前向卷积,后向卷积,池化等等。为了充分利用Tensorcore,cuDNN定义了 `NC/32HW32 <https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#nc32hw32-layout-x32>`_ ,为此我们原始的NCHW的网络需要转换到对应的layout才能调用cudnn的算子。

load_and_run 可以通过 ``--enable-nchw32`` 这个选项开启layout转换。

.. code-block:: bash

    ./load_and_run ./resnet18.mge --input ./cat.npy --enable-nchw32 --fast-run
    mgb load-and-run: using MegBrain 8.4.1(0) and MegDNN 9.3.0
    [03 21:28:21 from_argv@mgblar.cpp:1171][WARN] enable nchw32 optimization
    load model: 4307.589ms
    === prepare: 93.419ms; going to warmup
    [03 21:28:25 invoke@system.cpp:492][ERR] timeout is set, but no fork_exec_impl not given; timeout would be ignored
    [03 21:28:25 invoke@system.cpp:492][ERR] timeout is set, but no fork_exec_impl not given; timeout would be ignored
    [03 21:28:25 invoke@system.cpp:492][ERR] timeout is set, but no fork_exec_impl not given; timeout would be ignored
    warmup 0: 137.616ms
    === going to run input for 10 times
    iter 0/10: 9.873ms (exec=1.768,device=9.778)
    iter 1/10: 9.809ms (exec=1.662,device=9.776)
    iter 2/10: 9.806ms (exec=1.678,device=9.771)
    iter 3/10: 9.804ms (exec=1.625,device=9.773)
    iter 4/10: 9.801ms (exec=1.654,device=9.770)
    iter 5/10: 9.810ms (exec=1.609,device=9.775)
    iter 6/10: 9.800ms (exec=1.630,device=9.768)
    iter 7/10: 8.226ms (exec=1.600,device=8.195)
    iter 8/10: 7.754ms (exec=1.613,device=7.723)
    iter 9/10: 7.687ms (exec=1.619,device=7.655)
    === finished test #0: time=92.370ms avg_time=9.237ms sd=0.941ms minmax=7.687,9.873


基于自研的CHWN4
-----------------------------------------

除了前面两种基于Nvidia的sdk来加速Cuda上推理,MegEngine内部针对Tensorcore自研了CHWN4的layout的算法,这种Layout主要针对MegEngine内部自定义或者非标准的算子(如BatchConv, GroupLocal等)开发的,同时也支持标准的卷积算子。因为这种格式优先存放batch维的数据。在batch size较大的情况下,能很好地提升算子在GPU平台的性能。

开启方式类似,只需要传入 ``--enable-chwn4`` 即可。

.. code-block:: bash

    ./load_and_run ./resnet18.mge --input ./cat.npy --enable-chwn4 --fast-run
    mgb load-and-run: using MegBrain 8.4.1(0) and MegDNN 9.3.0
    [03 21:29:20 from_argv@mgblar.cpp:1168][WARN] enable chwn4 optimization
    load model: 4269.923ms
    === prepare: 85.530ms; going to warmup
    [03 21:29:24 invoke@system.cpp:492][ERR] timeout is set, but no fork_exec_impl not given; timeout would be ignored
    ....
    warmup 0: 226.736ms
    === going to run input for 10 times
    iter 0/10: 11.131ms (exec=0.429,device=11.039)
    iter 1/10: 11.117ms (exec=0.365,device=11.086)
    iter 2/10: 11.069ms (exec=0.342,device=11.032)
    iter 3/10: 11.084ms (exec=0.355,device=11.045)
    iter 4/10: 11.070ms (exec=0.362,device=11.037)
    iter 5/10: 11.057ms (exec=0.337,device=11.021)
    iter 6/10: 11.075ms (exec=0.365,device=11.039)
    iter 7/10: 11.060ms (exec=0.343,device=11.028)
    iter 8/10: 11.069ms (exec=0.340,device=11.038)
    iter 9/10: 11.056ms (exec=0.331,device=11.021)
    === finished test #0: time=110.788ms avg_time=11.079ms sd=0.025ms minmax=11.056,11.131