Created by: Xreki
给动态图添加一些RecordEvent,以使用profiler计时。主要添加了TraceOp定义、OpBase run时间。
Profiler结果
- tracer_option设置为"Default"时
------------------------- Event Summary -------------------------
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::trace::elementwise_mul 295021 20699.9 20163.544405 (0.974088) 536.377183 (0.025912) 0.049803 19.7763 0.0701642 0.124442
elementwise_mul 295021 15792 15255.596223 (0.966035) 536.377183 (0.033965) 0.037162 19.7454 0.0535283 0.094937
thread0::trace::sigmoid 278760 17430.3 16952.911439 (0.972612) 477.388863 (0.027388) 0.044703 54.5046 0.062528 0.104786
sigmoid 278760 13471.3 12993.881955 (0.964562) 477.388863 (0.035438) 0.033885 19.3493 0.0483257 0.0809856
thread0::trace::elementwise_add 188163 15421 15008.685350 (0.973260) 412.360942 (0.026740) 0.051291 20.3084 0.0819558 0.0927071
elementwise_add 188163 11770.5 11358.157335 (0.964967) 412.360942 (0.035033) 0.036847 20.2535 0.0625549 0.0707612
thread0::trace::tanh 185840 11365.5 11057.260118 (0.972881) 308.225771 (0.027119) 0.045966 45.9233 0.0611574 0.0683262
tanh 185840 8775.66 8467.431791 (0.964877) 308.225771 (0.035123) 0.034759 45.8629 0.0472216 0.0527569
thread0::matmul_grad 95243 11283.3 7796.735016 (0.691000) 3486.524817 (0.309000) 0.081368 105.813 0.118468 0.0678319
thread0::trace::matmul 95243 10924 9377.359161 (0.858419) 1546.629708 (0.141581) 0.088672 19.7287 0.114696 0.065672
matmul 95243 9146.79 7600.156775 (0.830910) 1546.629708 (0.169090) 0.073666 19.7017 0.0960363 0.054988
thread0::elementwise_mul_grad 278760 10797.3 10208.462237 (0.945468) 588.792533 (0.054532) 0.028732 21.4931 0.0387332 0.0649102
thread0::sigmoid_grad 278760 9968.17 9533.539719 (0.956398) 434.632331 (0.043602) 0.027511 24.2642 0.035759 0.0599259
thread0::trace::concat 102212 8197.7 7971.810544 (0.972445) 225.887403 (0.027555) 0.048438 23.9326 0.0802029 0.0492823
concat 102212 6336.53 6110.640915 (0.964352) 225.887403 (0.035648) 0.036144 22.4009 0.061994 0.0380935
GpuMemcpyAsync:CPU->GPU 2323 99.5479 95.156308 (0.955885) 4.391546 (0.044115) 0.026608 10.0145 0.0428531 0.000598455
GpuMemcpyAsync(same_gpu):GPU->GPU 16261 362.299 340.398368 (0.939550) 21.901080 (0.060450) 0.014118 14.8761 0.0222803 0.00217805
thread0::trace::split 92920 7483.12 7285.324363 (0.973568) 197.797091 (0.026432) 0.056189 21.1272 0.0805329 0.0449865
split 92920 5841.18 5643.387099 (0.966138) 197.797091 (0.033862) 0.0423 21.0755 0.0628625 0.0351156
thread0::elementwise_add_grad 188163 7328.05 6867.405289 (0.937140) 460.642014 (0.062860) 0.027718 35.002 0.0389452 0.0440542
thread0::tanh_grad 185840 6553.21 6268.124259 (0.956497) 285.087087 (0.043503) 0.027526 18.5929 0.0352627 0.0393961
thread0::trace::uniform_random 7 4619.67 4619.022944 (0.999861) 0.642942 (0.000139) 0.109459 4617.29 659.952 0.0277722
uniform_random 7 4619.57 4618.923239 (0.999861) 0.642942 (0.000139) 0.100064 4617.25 659.938 0.0277716
thread0::trace::slice 55752 4332.32 4200.342893 (0.969536) 131.982003 (0.030464) 0.042287 18.5202 0.0777071 0.0260448
slice 55752 3344.56 3212.578064 (0.960538) 131.982003 (0.039462) 0.037432 18.4722 0.05999 0.0201066
thread0::concat_grad 95243 4141.89 3928.251315 (0.948420) 213.638211 (0.051580) 0.030779 16.5058 0.0434876 0.0248999
GpuMemcpyAsync:CPU->GPU 2323 56.4214 51.910886 (0.920057) 4.510519 (0.079943) 0.015668 9.19188 0.0242882 0.00033919
thread0::concat 92920 4021.44 3811.756727 (0.947860) 209.679474 (0.052140) 0.030169 18.4992 0.0432785 0.0241758
thread0::slice_grad 46460 2005.84 1879.338492 (0.936931) 126.506328 (0.063069) 0.033499 18.46 0.0431736 0.0120586
thread0::trace::reduce_sum 20907 1887.86 1767.050701 (0.936008) 120.807216 (0.063992) 0.048801 16.5262 0.0902979 0.0113493
reduce_sum 20907 1723.07 1602.262647 (0.929888) 120.807216 (0.070112) 0.043879 16.52 0.0824159 0.0103586
thread0::trace::reshape2 16261 1459.65 1432.694261 (0.981531) 26.958985 (0.018469) 0.053817 18.4844 0.0897641 0.00877504
reshape2 16261 1221.98 1195.020813 (0.977938) 26.958985 (0.022062) 0.046352 18.4596 0.0751479 0.00734621
GpuMemcpyAsync(same_gpu):GPU->GPU 16261 670.593 643.633849 (0.959798) 26.958985 (0.040202) 0.02053 18.4275 0.0412393 0.00403142
- tracer_option设置为"OpDetail"时
------------------------- Event Summary -------------------------
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::trace::elementwise_mul 295021 22224 21689.400541 (0.975943) 534.647137 (0.024057) 0.051376 47.4034 0.0753304 0.12384
elementwise_mul 295021 16866.1 16331.465778 (0.968301) 534.647137 (0.031699) 0.038085 47.3347 0.0571692 0.093984
elementwise_mul/mutable_var 295021 815.767 815.766571 (1.000000) 0.000000 (0.000000) 0.001909 15.3072 0.00276511 0.00454574
elementwise_mul/infer_shape 295021 1545.64 1545.637178 (1.000000) 0.000000 (0.000000) 0.003805 13.0704 0.00523908 0.00861284
elementwise_mul/compute 295021 9231.31 8696.660879 (0.942083) 534.647137 (0.057917) 0.018881 47.3015 0.0312903 0.0514401
thread0::trace::sigmoid 278760 18867.9 18390.807995 (0.974717) 477.043869 (0.025283) 0.045942 45.1106 0.0676849 0.105138
sigmoid 278760 14497.2 14020.118776 (0.967094) 477.043869 (0.032906) 0.034785 45.0921 0.0520059 0.0807833
sigmoid/mutable_var 278760 728.007 728.007330 (1.000000) 0.000000 (0.000000) 0.00185 12.3596 0.00261159 0.00405672
sigmoid/infer_shape 278760 818.173 818.173317 (1.000000) 0.000000 (0.000000) 0.002195 4.44803 0.00293505 0.00455915
sigmoid/compute 278760 8384.28 7907.235240 (0.943103) 477.043869 (0.056897) 0.018679 23.8468 0.0300771 0.0467202
thread0::trace::elementwise_add 188163 16498.3 16085.303700 (0.974968) 412.989688 (0.025032) 0.054022 25.675 0.0876809 0.0919343
elementwise_add 188163 12450.2 12037.222237 (0.966829) 412.989688 (0.033171) 0.038458 25.6021 0.0661672 0.069377
elementwise_add/mutable_var 188163 537.545 537.545205 (1.000000) 0.000000 (0.000000) 0.001911 15.0784 0.00285681 0.00299539
elementwise_add/infer_shape 188163 1129.68 1129.684713 (1.000000) 0.000000 (0.000000) 0.003661 17.2693 0.00600376 0.006295
elementwise_add/compute 188163 7323.43 6910.438996 (0.943607) 412.989688 (0.056393) 0.019223 24.7346 0.0389207 0.0408087
thread0::trace::tanh 185840 12511.6 12203.161753 (0.975351) 308.401579 (0.024649) 0.047614 48.9886 0.0673244 0.0697189
tanh 185840 9548.18 9239.782378 (0.967700) 308.401579 (0.032300) 0.036118 48.9272 0.0513785 0.0532059
tanh/mutable_var 185840 449.617 449.617061 (1.000000) 0.000000 (0.000000) 0.001855 4.49429 0.00241938 0.00250543
tanh/infer_shape 185840 564.351 564.351379 (1.000000) 0.000000 (0.000000) 0.002182 11.0229 0.00303676 0.00314477
tanh/compute 185840 5705.06 5396.660562 (0.945942) 308.401579 (0.054058) 0.019368 48.8987 0.0306988 0.0317906
thread0::matmul_grad 95243 12020.6 8535.450835 (0.710066) 3485.196014 (0.289934) 0.082694 91.1439 0.12621 0.0669833
thread0::matmul_grad/mutable_var 95243 245.919 245.919410 (1.000000) 0.000000 (0.000000) 0.001893 2.13686 0.00258202 0.00137035
thread0::matmul_grad/infer_shape 95243 448.776 448.775658 (1.000000) 0.000000 (0.000000) 0.003282 21.5141 0.0047119 0.00250074
thread0::matmul_grad/compute 95243 10316.5 6831.259899 (0.662171) 3485.196014 (0.337829) 0.068233 91.1095 0.108317 0.057487
thread0::trace::matmul 95243 11746.7 10191.921382 (0.867640) 1554.793710 (0.132360) 0.090045 22.4605 0.123334 0.0654569
matmul 95243 9798.42 8243.627965 (0.841322) 1554.793710 (0.158678) 0.075643 22.3954 0.102878 0.0546003
matmul/mutable_var 95243 276.706 276.705942 (1.000000) 0.000000 (0.000000) 0.001888 5.48664 0.00290526 0.0015419
matmul/infer_shape 95243 608.525 608.525453 (1.000000) 0.000000 (0.000000) 0.004591 11.1404 0.00638919 0.00339092
matmul/compute 95243 7764.74 6209.941411 (0.799762) 1554.793710 (0.200238) 0.058763 22.3688 0.0815255 0.0432679
thread0::elementwise_mul_grad 278760 11402.4 10812.645272 (0.948278) 589.760714 (0.051722) 0.028939 33.1241 0.040904 0.0635383
thread0::elementwise_mul_grad/mutable_var 278760 693.415 693.414540 (1.000000) 0.000000 (0.000000) 0.001777 12.6198 0.0024875 0.00386395
thread0::elementwise_mul_grad/infer_shape 278760 1130.91 1130.914327 (1.000000) 0.000000 (0.000000) 0.002773 12.5793 0.00405695 0.00630186
thread0::elementwise_mul_grad/compute 278760 6759.91 6170.152945 (0.912756) 589.760714 (0.087244) 0.016099 33.0981 0.0242499 0.0376686
thread0::sigmoid_grad 278760 10411.2 9976.661146 (0.958259) 434.573065 (0.041741) 0.027713 18.5712 0.0373484 0.0580151
thread0::sigmoid_grad/mutable_var 278760 675.252 675.251732 (1.000000) 0.000000 (0.000000) 0.001856 9.11445 0.00242234 0.00376274
thread0::sigmoid_grad/infer_shape 278760 826.263 826.262670 (1.000000) 0.000000 (0.000000) 0.002369 3.83187 0.00296406 0.00460423
thread0::sigmoid_grad/compute 278760 6145.62 5711.049729 (0.929287) 434.573065 (0.070713) 0.015307 18.5486 0.0220463 0.0342456
RecordEvent的开销
基于该PR,关闭profiler功能之后,ptb模型的耗时,因为机器多人共用,会有一些波动。
One epoch cost 93.4636731148