Operators profiling in Transformer model.
Created by: peterzhang2029
Operators profiling:
Run 1 pass.
GPU:TITAN X (Pascal, 12GB global memory)
-------------------------> Profiling Report <-------------------------
Place: CUDA
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total Min. Max. Ave.
thread0::mul_grad 43941 13300.3 0.171008 5.76614 0.302686
thread0::layer_norm_grad 13590 12576.8 0.786432 58.5759 0.925442
thread0::layer_norm 13590 9943.62 0.661504 0.884736 0.731687
thread0::mul 43941 6476.36 0.0768 3.70176 0.147388
thread0::softmax 453 3779.52 7.69248 9.76282 8.3433
thread0::matmul_grad 16308 3431.95 0.145408 0.321536 0.210446
thread0::elementwise_add_grad 33522 2000.51 0.004096 0.2816 0.0596776
thread0::adam 82899 1963.99 0.003072 0.421888 0.0236914
thread0::sum 22197 1780.81 0.01024 0.611328 0.0802277
thread0::matmul 16308 1413.82 0.034816 3.33517 0.0866951
thread0::transpose 32616 1371.65 0.028672 2.69722 0.0420544
thread0::transpose_grad 32616 1370.53 0.028672 0.070656 0.0420202
thread0::elementwise_add 33522 1045.33 0.007168 24.4326 0.0311834
thread0::dropout_grad 22650 644.103 0.009216 0.060416 0.0284372
thread0::softmax_grad 453 611.858 0.900096 2.48218 1.35068
thread0::dropout 22650 600.103 0.006144 3.95059 0.0264946
thread0::scale 17214 395.886 0.003072 0.043008 0.0229979
thread0::relu_grad 5436 355.256 0.04608 0.105472 0.0653524
thread0::fill_zeros_like 49830 311.703 0.002048 0.026624 0.00625532
thread0::elementwise_mul 83352 282.403 0.002976 0.048128 0.00338808
thread0::relu 5436 258.603 0.031744 10.8749 0.0475723
thread0::elementwise_div_grad 8154 249.431 0.02048 0.062464 0.0305901
thread0::reduce_sum_grad 8607 245.62 0.003072 0.070656 0.0285372
thread0::lookup_table_grad 1812 191.795 0.043008 0.264192 0.105847
thread0::reduce_sum 8607 136.825 0.004096 0.070656 0.015897
thread0::exp_grad 8154 117.174 0.007168 0.043008 0.0143702
thread0::elementwise_div 8154 96.2836 0.007168 0.046976 0.0118081
thread0::cross_entropy_grad 453 86.9844 0.135168 0.362496 0.192019
thread0::exp 8154 76.608 0.004096 0.03072 0.00939514
thread0::lookup_table 1812 73.7901 0.023552 0.106496 0.040723
thread0::reshape 33975 66.6537 0.001024 3.75808 0.00196184
thread0::reshape_grad 33975 63.0589 0.001024 0.01536 0.00185604
thread0::cross_entropy 453 2.26896 0.004096 0.013312 0.00500874
thread0::elementwise_mul_grad 453 1.62397 0.003072 0.014336 0.00358492
CPU: Single thread
-------------------------> Profiling Report <-------------------------
Place: CPU
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total Min. Max. Ave.
thread0::mul_grad 43941 1.37484e+06 14.1704 934.205 31.2884
thread0::transpose_grad 32616 1.02324e+06 20.8854 122.053 31.3724
thread0::transpose 32616 1.0213e+06 20.8805 88.0507 31.3129
thread0::mul 43941 670698 7.00583 390.276 15.2636
thread0::softmax 453 462634 696.503 1808.74 1021.27
thread0::layer_norm_grad 13590 375420 18.1468 95.6174 27.6247
thread0::adam 82899 318720 0.012018 177.504 3.84468
thread0::layer_norm 13590 207723 10.2002 47.4344 15.285
thread0::reduce_sum_grad 8607 205651 0.016318 131.23 23.8935
thread0::softmax_grad 453 148503 222.244 589.22 327.821
thread0::dropout 22650 105028 1.1015 19.8472 4.637
thread0::matmul_grad 16308 49075.7 1.43275 21.8065 3.0093
thread0::elementwise_add 33522 34885.7 0.172053 33.2651 1.04068
thread0::sum 22197 33028.7 0.203313 22.527 1.48798
thread0::elementwise_add_grad 33522 30067.1 0.109021 16.5107 0.896935
thread0::relu_grad 5436 23078.8 2.31625 20.2764 4.24554
thread0::dropout_grad 22650 20590.6 0.169235 8.56589 0.909077
thread0::matmul 16308 20140.8 0.559672 8.58716 1.23502
thread0::elementwise_div_grad 8154 12677.3 0.598271 9.90769 1.55473
thread0::fill_zeros_like 49830 12354.7 0.002088 9.29022 0.247938
thread0::scale 17214 11808.7 0.001299 5.37838 0.685996
thread0::relu 5436 7071.9 0.691306 14.388 1.30094
thread0::lookup_table_grad 1812 5678.67 0.146436 21.6215 3.13392
thread0::cross_entropy_grad 453 5495.68 5.93957 38.3655 12.1317
thread0::reduce_sum 8607 5443.33 0.004744 4.33712 0.632431
thread0::exp 8154 3926.4 0.191484 3.60489 0.48153
thread0::elementwise_div 8154 3443.53 0.1601 4.35008 0.422312
thread0::exp_grad 8154 3120.79 0.124565 3.20178 0.382731
thread0::lookup_table 1812 1550.32 0.497613 2.62074 0.855583
thread0::elementwise_mul 83352 335.148 0.002086 0.041859 0.00402087
thread0::reshape_grad 33975 308.679 0.003699 2.49831 0.00908548
thread0::reshape 33975 114.659 0.001462 2.63728 0.00337482
thread0::cross_entropy 453 91.1172 0.110631 0.720848 0.201142
thread0::elementwise_mul_grad 453 4.59584 0.007498 0.025791 0.0101453