Profiling result on single GPU device (#637) · Issue · PaddlePaddle / models

Profiling result on single GPU device

Created by: kuke

See the script for profiling in #636 Device: Tesla K40m (12GB) Conclusion: The computation of LSTMP layer, especially the backward, takes the most time.

-----------  Configuration Arguments -----------
batch_size: 32
device: GPU
feature_lst: data/feature.lst
first_batches_to_skip: 1
hidden_dim: 1024
label_lst: data/label.lst
learning_rate: 0.002
max_batch_num: 10
mean_var: data/global_mean_var_search26kHr
parallel: False
print_train_acc: False
proj_dim: 512
sorted_key: total
stacked_num: 5
------------------------------------------------
..........
Time consumed: 18.386745 s, performance: 3199.098050 frames/s.

------------------------->     Profiling Report     <-------------------------

Place: CUDA
Time unit: ms
Sorted by total time in descending order in the same thread

Event                            Calls       Total       Min.        Max.        Ave.
thread0::lstmp_grad              45          9387.12     203.389     213.289     208.603
thread0::lstmp                   45          4344.11     94.435      98.2327     96.5359
thread0::mul_grad                54          1233.86     8.95155     42.5825     22.8492
thread0::mul                     54          599.213     4.5631      20.7512     11.0965
thread0::batch_norm_grad         54          570.634     8.05872     19.7179     10.5673
thread0::batch_norm              54          505.514     7.44077     17.1559     9.36137
thread0::sequence_conv_grad      9           441.255     45.7588     52.7421     49.0283
thread0::sequence_conv           9           232.747     24.8565     27.555      25.8607
thread0::elementwise_add_grad    63          105.817     0.516352    2.19734     1.67963
thread0::elementwise_add         63          89.0073     0.371936    2.18442     1.41281
thread0::adam                    369         55.4231     0.00704     0.808288    0.150198
thread0::softmax                 9           44.3238     4.69805     5.15946     4.92487
thread0::softmax_grad            9           19.5171     2.03376     2.31587     2.16857
thread0::sigmoid_grad            54          17.3561     0.25632     0.58752     0.32141
thread0::sigmoid                 54          12.0755     0.180064    0.403328    0.22362
thread0::top_k                   9           8.62765     0.901312    1.02582     0.958628
thread0::mean                    9           5.47344     0.571712    0.644768    0.60816
thread0::elementwise_mul         369         2.65594     0.005856    0.008512    0.00719766
thread0::cross_entropy_grad      9           2.55389     0.260096    0.311232    0.283765
thread0::fill_constant           378         2.31142     0.005216    0.01536     0.00611488
thread0::fill_zeros_like         216         1.52355     0.005504    0.012224    0.00705348
thread0::fetch                   18          0.783648    0.025696    0.072096    0.043536
thread0::accuracy                9           0.422208    0.046432    0.048224    0.046912
thread0::feed                    18          0.192256    0.006144    0.02        0.0106809
thread0::mean_grad               9           0.13536     0.013824    0.01536     0.01504
thread0::cross_entropy           9           0.125536    0.012672    0.014688    0.0139484
thread0::scale                   18          0.110144    0.005536    0.007744    0.00611911

PaddlePaddle / models 1 年多 前同步成功

Profiling result on single GPU device

PaddlePaddle / models
1 年多前同步成功