Profiling result on single GPU device
Created by: kuke
See the script for profiling in #636 Device: Tesla K40m (12GB) Conclusion: The computation of LSTMP layer, especially the backward, takes the most time.
----------- Configuration Arguments -----------
batch_size: 32
device: GPU
feature_lst: data/feature.lst
first_batches_to_skip: 1
hidden_dim: 1024
label_lst: data/label.lst
learning_rate: 0.002
max_batch_num: 10
mean_var: data/global_mean_var_search26kHr
parallel: False
print_train_acc: False
proj_dim: 512
sorted_key: total
stacked_num: 5
------------------------------------------------
..........
Time consumed: 18.386745 s, performance: 3199.098050 frames/s.
-------------------------> Profiling Report <-------------------------
Place: CUDA
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total Min. Max. Ave.
thread0::lstmp_grad 45 9387.12 203.389 213.289 208.603
thread0::lstmp 45 4344.11 94.435 98.2327 96.5359
thread0::mul_grad 54 1233.86 8.95155 42.5825 22.8492
thread0::mul 54 599.213 4.5631 20.7512 11.0965
thread0::batch_norm_grad 54 570.634 8.05872 19.7179 10.5673
thread0::batch_norm 54 505.514 7.44077 17.1559 9.36137
thread0::sequence_conv_grad 9 441.255 45.7588 52.7421 49.0283
thread0::sequence_conv 9 232.747 24.8565 27.555 25.8607
thread0::elementwise_add_grad 63 105.817 0.516352 2.19734 1.67963
thread0::elementwise_add 63 89.0073 0.371936 2.18442 1.41281
thread0::adam 369 55.4231 0.00704 0.808288 0.150198
thread0::softmax 9 44.3238 4.69805 5.15946 4.92487
thread0::softmax_grad 9 19.5171 2.03376 2.31587 2.16857
thread0::sigmoid_grad 54 17.3561 0.25632 0.58752 0.32141
thread0::sigmoid 54 12.0755 0.180064 0.403328 0.22362
thread0::top_k 9 8.62765 0.901312 1.02582 0.958628
thread0::mean 9 5.47344 0.571712 0.644768 0.60816
thread0::elementwise_mul 369 2.65594 0.005856 0.008512 0.00719766
thread0::cross_entropy_grad 9 2.55389 0.260096 0.311232 0.283765
thread0::fill_constant 378 2.31142 0.005216 0.01536 0.00611488
thread0::fill_zeros_like 216 1.52355 0.005504 0.012224 0.00705348
thread0::fetch 18 0.783648 0.025696 0.072096 0.043536
thread0::accuracy 9 0.422208 0.046432 0.048224 0.046912
thread0::feed 18 0.192256 0.006144 0.02 0.0106809
thread0::mean_grad 9 0.13536 0.013824 0.01536 0.01504
thread0::cross_entropy 9 0.125536 0.012672 0.014688 0.0139484
thread0::scale 18 0.110144 0.005536 0.007744 0.00611911