Some profiling results on Transformer
Created by: kuke
See profiling script in #1051
Profiling on single GPU:
Namespace(batch_size=3200, device='GPU', num_iters=10, opts=[], pool_size=200000, special_token=['<s>', '<e>', '<unk>'], src_vocab_fpath='data/vocab.bpe.32000', train_file_pattern='data/train.tok.clean.bpe.32000.en-de', trg_vocab_fpath='data/vocab.bpe.32000', use_token_batch=True)
Warming up ...
batch: 0, sum loss: 12268.457031, avg loss: 10.885942, ppl: 53420.105469
batch: 1, sum loss: 12178.395508, avg loss: 10.863868, ppl: 52253.792969
batch: 2, sum loss: 8879.164062, avg loss: 10.815060, ppl: 49764.625000
Profiling ...
batch: 0, sum loss: 12244.498047, avg loss: 10.864683, ppl: 52296.417969
batch: 1, sum loss: 12141.288086, avg loss: 10.830766, ppl: 50552.402344
batch: 2, sum loss: 8841.737305, avg loss: 10.769473, ppl: 47546.957031
batch: 3, sum loss: 9886.705078, avg loss: 10.781576, ppl: 48125.917969
batch: 4, sum loss: 14117.681641, avg loss: 10.818147, ppl: 49918.488281
batch: 5, sum loss: 12966.675781, avg loss: 10.760727, ppl: 47132.917969
batch: 6, sum loss: 13092.765625, avg loss: 10.749397, ppl: 46601.933594
batch: 7, sum loss: 11771.145508, avg loss: 10.681621, ppl: 43548.066406
batch: 8, sum loss: 13585.663086, avg loss: 10.680552, ppl: 43501.578125
batch: 9, sum loss: 14588.139648, avg loss: 10.632755, ppl: 41471.234375
-------------------------> Profiling Report <-------------------------
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total Min. Max. Ave.
thread0::mul_grad 960 650.981 0.339968 1.57696 0.678106
thread0::matmul_grad 370 565.571 0.264192 53.889 1.52857
thread0::softmax_with_cross_entropy 10 469.709 45.5281 48.127 46.9709
thread0::dropout 500 374.273 0.579584 1.0496 0.748546
thread0::matmul 370 372.095 0.136192 34.4699 1.00566
thread0::mul 960 314.561 0.162816 0.748544 0.327668
thread0::layer_norm_grad 300 263.247 0.756736 0.930816 0.877489
thread0::elementwise_add_grad 740 153.038 0.050176 0.909312 0.206808
thread0::sum 320 127.341 0.17408 3.48467 0.397939
thread0::softmax_grad 180 127.295 0.503808 0.910336 0.707197
thread0::layer_norm 300 123.037 0.352256 0.436224 0.410122
thread0::adam 1810 88.9652 0.008192 2.20774 0.049152
thread0::elementwise_add 740 83.4826 0.054272 0.25088 0.112814
thread0::softmax 180 79.7266 0.326656 0.587776 0.442926
thread0::transpose 720 61.1564 0.070656 2.22003 0.0849395
thread0::transpose_grad 720 58.6763 0.069632 0.088064 0.0814948
thread0::softmax_with_cross_entropy_grad 10 53.2357 4.92954 5.63917 5.32357
thread0::label_smooth 10 50.7187 4.66534 5.3975 5.07187
thread0::dropout_grad 500 47.9016 0.077824 0.126976 0.0958033
thread0::relu_grad 120 41.0081 0.288768 0.364544 0.341734
thread0::relu 120 29.1768 0.205824 0.259072 0.24314
thread0::scale 420 24.8033 0.008192 0.06656 0.0590555
thread0::fill_zeros_like 1100 21.2132 0.008192 0.041984 0.0192847
thread0::one_hot 10 19.67 1.79712 2.08896 1.967
thread0::reshape 1110 15.6437 0.002048 0.063488 0.0140934
thread0::lookup_table_grad 40 13.7851 0.164864 0.53248 0.344627
thread0::lookup_table 40 4.0704 0.064512 0.181248 0.10176
thread0::reshape_grad 1110 3.08541 0.002048 0.004096 0.00277965
thread0::feed 190 1.96202 0.007168 0.034816 0.0103264
thread0::fetch 20 0.835584 0.033792 0.053248 0.0417792
thread0::reduce_sum 20 0.196608 0.009216 0.011264 0.0098304
thread0::elementwise_mul 10 0.094208 0.009216 0.01024 0.0094208
thread0::reduce_sum_grad 10 0.093184 0.009216 0.01024 0.0093184
thread0::elementwise_mul_grad 10 0.089088 0.008192 0.009216 0.0089088
thread0::elementwise_div 10 0.089088 0.008192 0.009216 0.0089088
Elapsed time: total 6.818273 s, in executor 5.527260 s