Created by: Xreki
PR types
Performance optimization
PR changes
OPs
Describe
fill_constant op目标输出的shape,可支持通过输入变量ShapeTensor
来设置。ShapeTensor
的数据,需要在CPU上使用。因此,ShapeTensor
没有必要进行data transform。这个PR通过重写GetKernelTypeForVar
函数来避免对ShapeTensor
、ShapeTensorList
的data transform。当ShapeTensor
本来就在CPU上时,可避免CPU -> GPU的data transform,以及GPU -> CPU的传输。
以一个由abs构成的简单网络为例,develop的profile结果如下:
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::fill_constant 1000 2163.77 1488.538387 (0.687937) 675.233993 (0.312063) 2.11586 2.95418 2.16377 0.377412
thread0::fill_constant/prepare_data 1000 1417.8 1415.906315 (0.998666) 1.892017 (0.001334) 1.36487 2.1164 1.4178 0.247297
GpuMemcpySync:CPU->GPU 1000 30.8863 28.994266 (0.938742) 1.892017 (0.061258) 0.026714 0.775044 0.0308863 0.00538729
thread0::fill_constant/compute 1000 733.565 60.222613 (0.082096) 673.341976 (0.917904) 0.718384 0.813246 0.733565 0.127951
GpuMemcpySync:GPU->CPU 1000 30.1841 28.165868 (0.933135) 2.018263 (0.066865) 0.028091 0.100898 0.0301841 0.00526482
thread0::fill_constant/infer_shape 1000 4.1711 4.171098 (1.000000) 0.000000 (0.000000) 0.003393 0.037167 0.0041711 0.000727537
thread0::abs_grad 1000 2100.57 33.059464 (0.015738) 2067.514875 (0.984262) 2.06689 2.17703 2.10057 0.366389
thread0::abs_grad/compute 1000 2085.56 18.044247 (0.008652) 2067.514875 (0.991348) 2.05328 2.14258 2.08556 0.36377
thread0::abs_grad/infer_shape 1000 3.64797 3.647974 (1.000000) 0.000000 (0.000000) 0.003097 0.021624 0.00364797 0.000636292
thread0::abs_grad/prepare_data 1000 2.08001 2.080013 (1.000000) 0.000000 (0.000000) 0.001869 0.011697 0.00208001 0.000362803
thread0::abs 1000 1449.36 42.104390 (0.029050) 1407.252035 (0.970950) 1.43082 5.3383 1.44936 0.252802
thread0::abs/compute 1000 1434.47 27.213455 (0.018971) 1407.252035 (0.981029) 1.41723 5.27658 1.43447 0.250204
thread0::abs/infer_shape 1000 3.39102 3.391025 (1.000000) 0.000000 (0.000000) 0.002799 0.016075 0.00339102 0.000591474
thread0::abs/prepare_data 1000 2.41072 2.410721 (1.000000) 0.000000 (0.000000) 0.00205 0.027976 0.00241072 0.000420486
thread0::shape 1000 19.4742 19.474245 (1.000000) 0.000000 (0.000000) 0.017236 0.084414 0.0194742 0.00339676
thread0::shape/compute 1000 6.05756 6.057555 (1.000000) 0.000000 (0.000000) 0.00504 0.046566 0.00605756 0.00105658
thread0::shape/infer_shape 1000 3.08725 3.087250 (1.000000) 0.000000 (0.000000) 0.002631 0.014997 0.00308725 0.000538489
thread0::shape/prepare_data 1000 2.05583 2.055828 (1.000000) 0.000000 (0.000000) 0.001763 0.01555 0.00205583 0.000358584
该PR优化后,profile结果如下:
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::abs_grad 1000 2094.88 25.081476 (0.011973) 2069.803478 (0.988027) 2.04439 2.25984 2.09488 0.493975
thread0::abs_grad/compute 1000 2087.18 17.374182 (0.008324) 2069.803478 (0.991676) 2.0375 2.25199 2.08718 0.492158
thread0::abs_grad/infer_shape 1000 1.92794 1.927938 (1.000000) 0.000000 (0.000000) 0.001504 0.015205 0.00192794 0.000454609
thread0::abs_grad/prepare_data 1000 1.11049 1.110490 (1.000000) 0.000000 (0.000000) 0.000887 0.015823 0.00111049 0.000261854
thread0::abs 1000 1431.24 44.346750 (0.030985) 1386.891605 (0.969015) 1.40744 5.55335 1.43124 0.337487
thread0::abs/compute 1000 1421.63 34.739583 (0.024436) 1386.891605 (0.975564) 1.39997 5.49127 1.42163 0.335222
thread0::abs/infer_shape 1000 2.65737 2.657372 (1.000000) 0.000000 (0.000000) 0.001122 0.018938 0.00265737 0.00062661
thread0::abs/prepare_data 1000 1.74957 1.749568 (1.000000) 0.000000 (0.000000) 0.001216 0.01942 0.00174957 0.000412549
thread0::fill_constant 1000 701.393 29.554290 (0.042137) 671.839210 (0.957863) 0.685444 0.904966 0.701393 0.165389
thread0::fill_constant/compute 1000 693.483 21.643423 (0.031210) 671.839210 (0.968790) 0.679312 0.896107 0.693483 0.163524
thread0::fill_constant/infer_shape 1000 2.91289 2.912890 (1.000000) 0.000000 (0.000000) 0.001764 0.03279 0.00291289 0.000686861
thread0::fill_constant/prepare_data 1000 1.06425 1.064252 (1.000000) 0.000000 (0.000000) 0.000827 0.012405 0.00106425 0.000250951
thread0::shape 1000 13.3539 13.353900 (1.000000) 0.000000 (0.000000) 0.009428 0.09379 0.0133539 0.00314886
thread0::shape/compute 1000 5.24565 5.245645 (1.000000) 0.000000 (0.000000) 0.003475 0.05904 0.00524565 0.00123693
thread0::shape/infer_shape 1000 2.45793 2.457927 (1.000000) 0.000000 (0.000000) 0.001659 0.018921 0.00245793 0.000579581
thread0::shape/prepare_data 1000 2.13448 2.134481 (1.000000) 0.000000 (0.000000) 0.000912 0.027866 0.00213448 0.000503312