mpi训练速度还是很慢,已经使用py_reader读取数据
Created by: 333caowei
dssm模型,训练速度: Batch time: 15.881043, sample_per_second: 32.239696, py_reader_queue_size: 188
从Profiling从没看出瓶颈是啥
train.log的Profiling如下:
-------------------------> Profiling Report <-------------------------
Note! This Report merge all thread info into one.
Place: CPU
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total Min. Max. Ave. Ratio.
fetch_barrier 90 761229 202.291 17143 8458.1 0.450949
sequence_conv_grad 40320 160199 1.54895 27.9581 3.97318 0.0949011
batch_norm_grad 20160 117971 2.17369 26.3294 5.85176 0.0698859
mul_grad 20160 102925 1.62898 41.2276 5.10542 0.0609725
batch_norm 20160 89984.8 1.76818 17.7321 4.46353 0.0533067
sequence_conv 40320 85197.9 0.772446 25.6383 2.11304 0.0504709
mul 20160 57996.2 0.78887 21.5311 2.8768 0.0343568
sum 30240 47220.6 0.060923 11.9747 1.56153 0.0279733
concat 36090 33670.3 0.016723 44.3501 0.932953 0.0199462
elementwise_add 65520 30987.1 0.011574 22.7972 0.472941 0.0183566
elementwise_add_grad 65520 27419.4 0.019232 10.9512 0.41849 0.0162432
lookup_table 55440 23649.9 0.020133 9.46655 0.426585 0.0140101
sequence_pool 40320 22695.6 0.207364 23.9766 0.562887 0.0134448
concat_grad 30240 15346.1 0.10278 7.52225 0.507476 0.00909096
sequence_pool_grad 40320 15308.8 0.045362 8.04285 0.379684 0.0090689
lookup_table_grad 55440 14739.7 0.01907 11.9122 0.265867 0.00873172
tanh 60480 12936.1 0.072926 1.33797 0.213891 0.0076633
tanh_grad 60480 9817.2 0.053188 3.17885 0.162321 0.00581567
scale 20160 8744.07 0.057911 7.86351 0.433734 0.00517995
broadcast 3690 6662.5 0.132456 7.95325 1.80556 0.00394684
ScopeBufferedSSAGraphExecutorAfterRun 90 6600.67 52.3774 115.897 73.3408 0.00391021
reduce 3690 5894.4 0.145929 28.4731 1.5974 0.00349182
send_barrier 90 5259.27 15.6305 321.357 58.4363 0.00311557
cos_sim_grad 5040 5190.01 0.176672 13.8599 1.02976 0.00307454
ThreadedSSAGraphExecutorPrepare 90 3988.92 34.4373 81.4597 44.3214 0.00236302
read 10080 2794.7 0.031585 2.85077 0.277252 0.00165557
split_selected_rows 270 2273.31 2.69046 25.6101 8.41965 0.0013467
auc 10080 1817.73 0.045017 1.66554 0.180331 0.00107682
cos_sim 5040 1610.53 0.142193 2.27312 0.319549 0.000954071
cast 5040 1311.34 0.011168 5.96485 0.260187 0.000776836
recv 3690 1088.71 0.010092 1.07391 0.295043 0.000644948
elementwise_mul_grad 5040 880.773 0.022939 2.36265 0.174756 0.000521766
elementwise_sub 10080 835.183 0.012162 1.57411 0.0828554 0.000494759
square_grad 5040 502.86 0.013737 1.68386 0.0997738 0.000297892
mean_grad 5040 493.471 0.015765 1.7871 0.097911 0.000292331
elementwise_sub_grad 5040 482.207 0.011793 2.03981 0.0956761 0.000285658
fill_constant 10080 453.586 0.004554 0.865451 0.0449986 0.000268703
mean 5040 424.447 0.006344 2.09921 0.0842158 0.000251441
square 5040 405.676 0.008229 1.8082 0.0804913 0.000240321
fill_constant_batch_size_like 5040 381.764 0.010112 1.40453 0.0757468 0.000226156
elementwise_mul 5040 369.702 0.014741 1.24299 0.0733535 0.00021901
send 3690 238.802 0.008305 4.58592 0.064716 0.000141465
create_double_buffer_reader 5040 35.2148 0.001253 0.123469 0.00698707 2.08611e-05
split_byref 540 25.1774 0.017677 0.141838 0.0466247 1.4915e-05
-------------------------> Profiling Report <-------------------------
Place: CPU
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total Min. Max. Ave. Ratio.
thread32::fetch_barrier 4 33112.4 997.959 16221.7 8278.1 0.536985
thread32::sequence_conv_grad 1250 5157.24 1.65616 19.6345 4.12579 0.0836351
thread32::batch_norm_grad 597 3778.16 2.17785 23.8654 6.32857 0.0612705
thread32::mul_grad 635 3190.36 1.65257 30.5185 5.0242 0.0517383
thread32::batch_norm 574 2824.86 1.77967 11.6922 4.92136 0.0458109
thread32::sequence_conv 1172 2609.79 0.81816 12.2794 2.22678 0.0423231
thread32::mul 571 1763.62 0.7956 16.0639 3.08866 0.0286007
thread32::sum 953 1419.51 0.074565 6.76241 1.48952 0.0230203
thread32::concat 1149 1075.02 0.019526 27.7234 0.935611 0.0174336
thread32::elementwise_add 2008 1016.74 0.016663 22.7972 0.506347 0.0164886
thread32::elementwise_add_grad 2013 814.618 0.019352 9.29175 0.404679 0.0132107
thread32::lookup_table 1833 748.177 0.023729 5.47994 0.408171 0.0121332
thread32::sequence_pool 1238 711.355 0.219503 23.6129 0.5746 0.0115361
thread32::concat_grad 928 486.776 0.116789 4.79188 0.524543 0.00789407
thread32::lookup_table_grad 1724 432.322 0.023131 4.43831 0.250767 0.00701098
thread32::sequence_pool_grad 1202 427.692 0.057404 5.59981 0.355817 0.0069359
thread32::tanh 1717 388.454 0.075122 0.692794 0.22624 0.00629957
thread32::tanh_grad 1882 310.839 0.068421 3.16892 0.165164 0.00504089
thread32::scale 657 269.238 0.068524 7.81513 0.409799 0.00436625
thread32::broadcast 109 203.082 0.158403 7.9319 1.86314 0.00329339
thread32::reduce 103 182.949 0.20158 16.3275 1.7762 0.00296689
thread32::cos_sim_grad 159 158.318 0.201341 8.285 0.995708 0.00256744
thread32::read 312 85.0389 0.033594 1.14405 0.27256 0.00137908
thread32::cos_sim 206 67.6419 0.153031 1.09429 0.328359 0.00109695
thread32::split_selected_rows 7 63.759 2.82136 15.5368 9.10842 0.00103398
thread32::send_barrier 2 61.1851 21.1857 39.9994 30.5926 0.000992241
thread32::cast 183 50.5685 0.014297 5.96485 0.27633 0.000820071
thread32::auc 257 45.1107 0.060827 1.24958 0.175528 0.000731562
thread32::recv 111 32.157 0.017725 0.95414 0.289702 0.000521491
thread32::elementwise_sub 348 28.1236 0.014441 1.04099 0.0808149 0.000456082
thread32::elementwise_mul_grad 165 27.3382 0.028309 2.08034 0.165686 0.000443345
thread32::square_grad 153 15.013 0.020982 0.712499 0.098124 0.000243466
thread32::mean_grad 162 14.91 0.018528 0.533497 0.0920369 0.000241796
thread32::elementwise_sub_grad 163 14.4693 0.014491 0.607762 0.088769 0.00023465
thread32::fill_constant_batch_size_like 164 14.4182 0.010354 1.40453 0.087916 0.000233821
thread32::fill_constant 322 14.1305 0.00535 0.573844 0.0438835 0.000229155
thread32::square 171 12.9158 0.011128 0.520051 0.0755311 0.000209456
thread32::elementwise_mul 135 12.4509 0.015132 1.12853 0.092229 0.000201917
thread32::mean 168 10.6736 0.007934 0.682096 0.0635334 0.000173094
thread32::send 134 10.3941 0.011007 0.610125 0.0775683 0.000168562
thread32::create_double_buffer_reader 144 0.985273 0.001939 0.043271 0.00684217 1.59782e-05
thread32::split_byref 16 0.702879 0.024581 0.08162 0.0439299 1.13986e-05
thread31::fetch_barrier 2 29707.2 13954.6 15752.5 14853.6 0.510473
thread31::sequence_conv_grad 1321 5029.36 1.61402 18.1841 3.80724 0.0864221
thread31::batch_norm_grad 661 3737.72 2.19651 17.5022 5.65465 0.0642272
thread31::mul_grad 604 3110.82 1.65088 30.4991 5.15036 0.0534547
thread31::batch_norm 640 2850.89 1.7809 15.0088 4.45451 0.0489882
thread31::sequence_conv 1252 2635.87 0.804854 19.8472 2.10533 0.0452935
thread31::mul 642 1845.88 0.793949 16.7166 2.8752 0.0317186
thread31::sum 1006 1470.65 0.073308 8.40091 1.46188 0.025271
thread31::concat 1171 998.557 0.020134 15.9977 0.852739 0.0171587
thread31::elementwise_add 2017 928.144 0.018199 22.7931 0.460161 0.0159488
thread31::elementwise_add_grad 2062 846.29 0.02439 9.41433 0.410422 0.0145422
thread31::lookup_table 1753 743.682 0.023555 6.47188 0.424234 0.0127791
thread31::sequence_pool 1278 710.93 0.217464 23.5659 0.556283 0.0122163
thread31::sequence_pool_grad 1234 482.803 0.057885 5.11503 0.391251 0.00829625
thread31::concat_grad 975 481.379 0.114148 4.63596 0.493722 0.00827178
thread31::lookup_table_grad 1801 474.612 0.021095 5.36527 0.263527 0.0081555
thread31::tanh 1905 400.802 0.074705 0.641755 0.210395 0.00688719
thread31::tanh_grad 1891 310.317 0.063158 0.580271 0.164102 0.00533234
thread31::scale 598 261.63 0.072468 6.32622 0.437508 0.00449572
thread31::broadcast 115 213.454 0.144017 7.52968 1.85612 0.00366789
thread31::reduce 114 188.056 0.240406 16.0613 1.64962 0.00323147
thread31::cos_sim_grad 160 168.432 0.18633 5.78947 1.0527 0.00289426
thread31::split_selected_rows 12 104.382 2.71594 15.3531 8.69848 0.00179365
thread31::read 336 89.1137 0.038166 1.08559 0.265219 0.00153129
thread31::send_barrier 2 55.4628 23.7659 31.6969 27.7314 0.000953046
thread31::auc 289 54.4152 0.058296 1.0178 0.188288 0.000935045
thread31::cos_sim 157 47.1081 0.154933 0.770324 0.300052 0.000809483
thread31::cast 149 37.7969 0.014777 2.52957 0.25367 0.000649483
thread31::recv 121 35.4511 0.01536 1.0179 0.292984 0.000609174
thread31::elementwise_mul_grad 165 33.5809 0.028159 1.22163 0.20352 0.000577037
thread31::elementwise_sub 315 21.4376 0.015954 0.569542 0.0680557 0.000368372
thread31::mean_grad 135 16.1359 0.021089 1.33524 0.119525 0.000277271
thread31::fill_constant 327 15.5924 0.005519 0.781848 0.0476832 0.000267932
thread31::square_grad 136 15.5098 0.01714 0.784029 0.114043 0.000266513
thread31::square 173 13.7716 0.012742 0.743037 0.0796045 0.000236644
thread31::mean 166 13.7078 0.00865 0.770362 0.0825773 0.000235549
thread31::elementwise_mul 170 12.0182 0.017927 0.577616 0.0706953 0.000206515
thread31::fill_constant_batch_size_like 146 11.6476 0.011157 0.735868 0.079778 0.000200147
thread31::elementwise_sub_grad 137 11.5625 0.015112 0.534516 0.0843981 0.000198685
thread31::send 116 7.34807 0.010338 0.342645 0.0633455 0.000126266
thread31::create_double_buffer_reader 181 1.38442 0.001959 0.05295 0.00764875 2.37893e-05
thread31::split_byref 12 0.464637 0.023763 0.079217 0.0387197 7.98409e-06
thread30::fetch_barrier 6 32927.3 768.03 16241.7 5487.89 0.535654
thread30::sequence_conv_grad 1260 5117.78 1.65352 23.1999 4.06173 0.083255
thread30::batch_norm_grad 612 3779.68 2.18197 18.595 6.17594 0.0614869
thread30::mul_grad 620 3198.24 1.64517 23.467 5.15845 0.0520282
thread30::batch_norm 639 2865.91 1.77916 11.9075 4.485 0.046622
thread30::sequence_conv 1224 2653.58 0.788904 23.9 2.16796 0.0431679
thread30::mul 637 1829.54 0.790907 15.263 2.87212 0.0297626
thread30::sum 928 1435.44 0.08474 8.8055 1.54681 0.0233514
thread30::concat 1168 1015.29 0.019718 29.8214 0.869259 0.0165166
thread30::elementwise_add 2096 952.525 0.012715 9.00628 0.454449 0.0154955
thread30::elementwise_add_grad 1988 802.263 0.020041 7.57228 0.403553 0.013051
thread30::lookup_table 1650 736.848 0.021951 6.31711 0.446575 0.0119869
thread30::sequence_pool 1256 712.279 0.21458 23.5894 0.567101 0.0115872
thread30::concat_grad 910 458.561 0.114077 4.9729 0.503914 0.00745977
thread30::lookup_table_grad 1737 443.367 0.020457 7.81215 0.255249 0.00721259
thread30::sequence_pool_grad 1212 435.615 0.051663 5.57388 0.359419 0.00708649
thread30::tanh 1994 427.126 0.079966 0.645406 0.214206 0.00694839
thread30::tanh_grad 1791 290.28 0.060298 1.0224 0.162077 0.00472221
thread30::scale 648 244.869 0.070999 6.07203 0.377884 0.00398347
thread30::broadcast 105 194.755 0.145982 7.3481 1.85481 0.00316823
thread30::reduce 112 179.287 0.211917 15.6643 1.60078 0.0029166
thread30::cos_sim_grad 158 152.698 0.206023 7.29127 0.966441 0.00248405
thread30::send_barrier 2 111.308 54.8178 56.4898 55.6538 0.00181073
thread30::read 342 86.8546 0.040959 1.93822 0.253961 0.00141293
thread30::split_selected_rows 9 73.1806 3.77164 14.394 8.13118 0.00119049
thread30::auc 316 54.7942 0.055328 1.31401 0.173399 0.000891379
thread30::cos_sim 142 44.862 0.146846 1.00868 0.315929 0.000729804
thread30::cast 151 37.3791 0.01633 4.96162 0.247543 0.000608074
thread30::recv 129 37.0795 0.018008 0.941024 0.287438 0.000603201
thread30::elementwise_mul_grad 147 27.4848 0.026588 1.25138 0.186972 0.000447117
thread30::elementwise_sub 309 23.8011 0.013637 0.851214 0.0770263 0.000387191
thread30::elementwise_sub_grad 168 18.0279 0.01479 0.901324 0.107309 0.000293273
thread30::square_grad 162 16.7725 0.017734 0.897742 0.103534 0.000272851
thread30::mean 158 13.9405 0.008385 1.27701 0.0882308 0.000226781
thread30::mean_grad 147 13.6628 0.017457 0.747887 0.0929445 0.000222264
thread30::square 157 13.4525 0.010839 1.16464 0.0856845 0.000218842
thread30::fill_constant 302 13.0953 0.005773 0.617482 0.043362 0.000213032
thread30::elementwise_mul 141 11.8438 0.018806 1.24299 0.0839989 0.000192673
thread30::fill_constant_batch_size_like 179 11.0472 0.010749 0.495749 0.0617162 0.000179713
thread30::send 114 7.40501 0.011387 0.772236 0.0649562 0.000120463
thread30::split_byref 20 1.00707 0.022252 0.099465 0.0503533 1.63827e-05
thread30::create_double_buffer_reader 164 0.958619 0.00202 0.031131 0.00584524 1.55946e-05
thread29::sequence_conv_grad 1209 5134.83 1.6826 25.5124 4.24717 0.169955
thread29::batch_norm_grad 621 3733.77 2.20231 20.0303 6.01252 0.123582
thread29::mul_grad 620 3130.63 1.67314 21.9439 5.04941 0.103619
thread29::batch_norm 623 2836.01 1.78996 12.3924 4.55218 0.0938676
thread29::sequence_conv 1234 2675.64 0.798467 23.7637 2.16826 0.0885596
thread29::mul 619 1788.76 0.793701 12.5842 2.88976 0.0592052
thread29::fetch_barrier 2 1534.81 692.723 842.083 767.403 0.0507998
thread29::sum 841 1433.89 0.060923 10.6762 1.70498 0.0474596
thread29::concat 1117 1063.61 0.020658 29.6565 0.9522 0.0352038
thread29::elementwise_add 2021 956.749 0.014785 6.68461 0.473404 0.0316669
thread29::elementwise_add_grad 1954 865.14 0.026153 7.45574 0.442753 0.0286348
thread29::lookup_table 1718 730.907 0.024758 6.03704 0.425441 0.0241919
thread29::sequence_pool 1235 702.823 0.216771 23.4534 0.569087 0.0232624
thread29::concat_grad 929 468.455 0.10278 4.35634 0.504258 0.0155052
thread29::sequence_pool_grad 1237 456.783 0.046017 5.8224 0.369267 0.0151188
thread29::lookup_table_grad 1694 454.528 0.020587 6.07156 0.268316 0.0150442
thread29::tanh 1906 409.875 0.07901 0.627835 0.215045 0.0135662
thread29::tanh_grad 1896 313.134 0.061119 3.17885 0.165155 0.0103643
thread29::scale 652 284.882 0.067297 6.39905 0.436936 0.00942916
thread29::broadcast 117 211.614 0.139975 7.5873 1.80866 0.00700409
thread29::send_barrier 3 197.263 57.7332 81.736 65.7544 0.00652911
thread29::cos_sim_grad 174 180.211 0.206435 10.1919 1.03569 0.0059647
thread29::reduce 104 163.171 0.223928 16.0822 1.56895 0.00540071
thread29::read 280 90.7388 0.048997 1.39478 0.324067 0.00300332
thread29::auc 307 54.176 0.058409 0.930818 0.176469 0.00179314
thread29::cos_sim 155 52.9591 0.166349 1.4586 0.341671 0.00175287
thread29::cast 147 38.7476 0.012766 3.36491 0.263589 0.00128249
thread29::recv 117 37.3646 0.017274 0.929309 0.319356 0.00123671
thread29::split_selected_rows 4 33.1586 4.21481 11.1726 8.28966 0.0010975
thread29::elementwise_mul_grad 170 32.8059 0.025101 2.35777 0.192976 0.00108583
thread29::elementwise_sub 303 27.6947 0.01656 1.03651 0.0914016 0.000916652
thread29::elementwise_sub_grad 170 16.7377 0.014155 0.843866 0.0984568 0.000553991
thread29::square 174 15.0917 0.011029 1.36976 0.0867338 0.000499512
thread29::mean 155 14.8939 0.012552 0.921539 0.09609 0.000492967
thread29::mean_grad 154 14.4337 0.018362 0.662406 0.0937252 0.000477733
thread29::square_grad 122 12.7592 0.017876 1.68386 0.104584 0.00042231
thread29::fill_constant 304 12.6673 0.004984 0.38295 0.0416689 0.00041927
thread29::fill_constant_batch_size_like 188 12.5862 0.014033 0.69352 0.066948 0.000416585
thread29::elementwise_mul 162 10.0433 0.018114 0.40815 0.0619957 0.000332418
thread29::send 103 6.63412 0.012553 0.661339 0.064409 0.000219579
thread29::split_byref 21 0.956443 0.020687 0.086206 0.0455449 3.16568e-05
thread29::create_double_buffer_reader 136 0.930767 0.002173 0.040413 0.00684388 3.0807e-05
thread28::fetch_barrier 3 15566 731.701 14054.9 5188.67 0.352785
thread28::sequence_conv_grad 1217 5015.5 1.60556 15.9852 4.1212 0.11367
thread28::batch_norm_grad 590 3660.69 2.19591 19.9525 6.20456 0.0829651
thread28::mul_grad 626 3237.75 1.6841 27.1763 5.17213 0.0733798
thread28::batch_norm 600 2856.46 1.80548 12.9678 4.76077 0.0647383
thread28::sequence_conv 1214 2672.02 0.826396 23.1496 2.20101 0.0605581
thread28::mul 616 1786.42 0.799326 15.0881 2.90003 0.0404871
thread28::sum 929 1451.88 0.07905 7.64262 1.56285 0.0329052
thread28::concat 1159 1021.03 0.019928 20.1853 0.880954 0.0231403
thread28::elementwise_add 1940 931.851 0.015172 10.8858 0.480336 0.0211193
thread28::elementwise_add_grad 1964 858.692 0.020825 8.10077 0.437216 0.0194612
thread28::lookup_table 1659 744.649 0.020372 6.40308 0.448854 0.0168766
thread28::sequence_pool 1230 702.727 0.216314 23.8177 0.571322 0.0159264
thread28::concat_grad 975 512.901 0.112892 3.50805 0.526052 0.0116243
thread28::sequence_pool_grad 1284 488.831 0.045362 4.95196 0.380709 0.0110788
thread28::lookup_table_grad 1721 456.076 0.025844 7.38565 0.265006 0.0103364
thread28::tanh 1832 397.226 0.079166 0.663381 0.216827 0.00900266
thread28::tanh_grad 1806 294.878 0.069085 1.2496 0.163277 0.00668306
thread28::scale 585 264.143 0.071104 5.4815 0.451526 0.00598647
thread28::broadcast 124 220.785 0.141802 6.03605 1.78052 0.00500382
thread28::send_barrier 3 210.875 48.6153 98.0241 70.2917 0.00477923
thread28::cos_sim_grad 155 170.659 0.186766 6.65201 1.10102 0.00386777
thread28::reduce 101 114.383 0.233775 16.9968 1.1325 0.00259235
thread28::read 340 84.0425 0.034425 1.61124 0.247184 0.00190472
thread28::auc 340 65.2949 0.05537 1.41675 0.192044 0.00147983
thread28::split_selected_rows 7 54.5682 4.1844 12.0312 7.79546 0.00123672
thread28::cos_sim 147 45.8704 0.145236 1.10472 0.312044 0.0010396
thread28::cast 155 41.7897 0.0144 2.37473 0.269611 0.000947112
thread28::recv 122 31.3298 0.011324 0.960444 0.256802 0.000710053
thread28::elementwise_mul_grad 171 24.7812 0.027175 0.847965 0.144919 0.000561637
thread28::elementwise_sub 265 22.6835 0.016036 1.26494 0.085598 0.000514093
thread28::fill_constant 337 15.7978 0.005704 0.606248 0.0468779 0.000358039
thread28::mean_grad 148 15.1091 0.018017 0.775741 0.102089 0.00034243
thread28::square 167 13.814 0.012455 0.944013 0.0827185 0.000313077
thread28::elementwise_sub_grad 137 13.6352 0.013938 0.897899 0.0995268 0.000309025
thread28::square_grad 157 13.2844 0.019777 0.737475 0.0846137 0.000301074
thread28::mean 153 12.5067 0.009462 0.940145 0.081743 0.000283449
thread28::fill_constant_batch_size_like 181 12.4944 0.013174 0.539264 0.0690298 0.00028317
thread28::elementwise_mul 167 10.7113 0.017727 0.492118 0.0641397 0.000242759
thread28::send 113 7.10326 0.013561 0.433759 0.0628607 0.000160987
thread28::create_double_buffer_reader 172 1.2063 0.002109 0.049891 0.0070134 2.73394e-05
thread28::split_byref 17 0.774264 0.029472 0.092702 0.0455449 1.75478e-05
thread27::fetch_barrier 4 29406.5 1209.77 14143.4 7351.64 0.506261
thread27::sequence_conv_grad 1237 5006.36 1.64577 20.7867 4.04718 0.0861892
thread27::batch_norm_grad 585 3409.13 2.19779 17.9873 5.82757 0.0586913
thread27::mul_grad 630 3278.65 1.64017 24.3022 5.20421 0.0564451
thread27::batch_norm 622 2839.97 1.86759 12.6277 4.56587 0.0488927
thread27::sequence_conv 1279 2684.07 0.889042 24.6597 2.09857 0.0462088
thread27::mul 577 1748.47 0.796849 15.6979 3.03027 0.0301015
thread27::sum 933 1480.77 0.078054 8.60146 1.58711 0.0254928
thread27::concat 1128 1063.8 0.020099 27.484 0.943087 0.0183143
thread27::elementwise_add 2026 960.644 0.014543 6.94819 0.474158 0.0165384
thread27::elementwise_add_grad 2089 889.251 0.022686 9.43554 0.425683 0.0153093
thread27::lookup_table 1765 735.363 0.021839 6.3001 0.416636 0.01266
thread27::sequence_pool 1265 707.816 0.216244 23.5843 0.559538 0.0121857
thread27::concat_grad 1040 527.55 0.115001 4.24172 0.50726 0.00908226
thread27::sequence_pool_grad 1252 507.849 0.052333 7.04819 0.40563 0.00874309
thread27::lookup_table_grad 1768 463.918 0.025629 10.0056 0.262397 0.00798679
thread27::tanh 1901 409.946 0.082556 0.659873 0.215648 0.00705761
thread27::tanh_grad 1935 312.516 0.060412 0.568534 0.161507 0.00538025
thread27::scale 632 285.53 0.064897 7.33192 0.451788 0.00491566
thread27::send_barrier 3 233.117 48.4784 121.599 77.7055 0.00401332
thread27::reduce 133 220.836 0.191205 18.0618 1.66042 0.0038019
thread27::broadcast 119 205.311 0.158503 6.19072 1.7253 0.00353462
thread27::cos_sim_grad 176 176.81 0.19704 5.54927 1.0046 0.00304396
thread27::split_selected_rows 10 96.8279 4.65727 17.6932 9.68279 0.00166698
thread27::read 336 82.9406 0.044074 1.47636 0.246847 0.0014279
thread27::auc 328 62.9201 0.053253 1.19445 0.19183 0.00108323
thread27::cos_sim 159 49.7489 0.144835 0.888927 0.312886 0.000856474
thread27::cast 166 41.6675 0.011168 2.08828 0.251009 0.000717344
thread27::recv 108 30.8038 0.017092 0.898941 0.285221 0.000530317
thread27::elementwise_mul_grad 157 27.7483 0.027431 1.05382 0.176741 0.000477713
thread27::elementwise_sub 304 27.1147 0.013748 1.30397 0.0891932 0.000466805
thread27::square_grad 147 13.9542 0.016845 1.04025 0.0949267 0.000240235
thread27::mean 155 13.8892 0.00874 0.846924 0.0896079 0.000239116
thread27::fill_constant_batch_size_like 169 13.8769 0.012518 1.19909 0.0821115 0.000238903
thread27::mean_grad 142 13.8445 0.021263 1.05775 0.0974968 0.000238347
thread27::elementwise_mul 174 13.5003 0.01664 0.801072 0.077588 0.000232421
thread27::fill_constant 313 13.1475 0.005031 0.666766 0.0420046 0.000226346
thread27::square 138 11.8336 0.011678 1.29488 0.0857509 0.000203727
thread27::elementwise_sub_grad 157 10.8754 0.014995 0.511468 0.0692704 0.000187231
thread27::send 109 5.05945 0.011137 0.288446 0.046417 8.71032e-05
thread27::create_double_buffer_reader 150 0.973883 0.002039 0.034502 0.00649255 1.67663e-05
thread27::split_byref 20 0.784766 0.022937 0.07276 0.0392383 1.35105e-05
thread26::fetch_barrier 3 14768.1 202.291 13821.9 4922.7 0.340468
thread26::sequence_conv_grad 1362 4957.06 1.54895 20.3399 3.63955 0.114281
thread26::batch_norm_grad 665 3425.33 2.18786 20.4966 5.15088 0.0789686
thread26::mul_grad 701 3407.95 1.64849 27.0761 4.86155 0.0785678
thread26::batch_norm 695 2748.69 1.79443 11.9709 3.95495 0.0633691
thread26::sequence_conv 1309 2642.26 0.821308 25.3928 2.01853 0.0609154
thread26::mul 673 1834.41 0.79512 15.8526 2.72571 0.0422909
thread26::sum 1025 1486.65 0.070291 7.94914 1.45039 0.0342736
thread26::concat 1174 1097.52 0.01792 28.6442 0.934856 0.0253025
thread26::elementwise_add 2176 1003.65 0.014859 10.1676 0.461237 0.0231385
thread26::elementwise_add_grad 2254 931.946 0.025983 9.30406 0.413463 0.0214853
thread26::lookup_table 1722 743.238 0.023703 6.13125 0.431613 0.0171348
thread26::sequence_pool 1297 707.328 0.212425 23.4386 0.545357 0.0163069
thread26::sequence_pool_grad 1376 491.403 0.056898 5.86648 0.357124 0.0113289
thread26::concat_grad 1029 482.712 0.109688 4.06371 0.469108 0.0111286
thread26::lookup_table_grad 1802 479.426 0.019269 6.62234 0.266052 0.0110528
thread26::tanh 1928 393.242 0.076417 0.664953 0.203964 0.00906591
thread26::tanh_grad 2050 328.153 0.064962 1.3954 0.160075 0.00756534
thread26::scale 612 269.072 0.067595 4.6713 0.43966 0.00620325
thread26::broadcast 114 213.036 0.258859 5.72949 1.86874 0.0049114
thread26::cos_sim_grad 152 161.094 0.19905 6.60044 1.05983 0.00371391
thread26::send_barrier 3 148.796 32.8186 79.1782 49.5988 0.00343039
thread26::reduce 109 132.541 0.209732 12.9687 1.21597 0.00305564
thread26::read 338 90.842 0.036786 2.42634 0.268763 0.0020943
thread26::split_selected_rows 9 86.2314 4.07072 14.9966 9.58127 0.001988
thread26::auc 310 56.8369 0.061882 1.27636 0.183345 0.00131033
thread26::cos_sim 154 45.3135 0.157009 1.02111 0.294244 0.00104467
thread26::cast 153 36.0897 0.013114 2.66384 0.235881 0.000832023
thread26::recv 113 33.5429 0.01243 1.06444 0.29684 0.000773307
thread26::elementwise_mul_grad 172 29.9557 0.029852 1.39627 0.174161 0.000690608
thread26::elementwise_sub 328 26.8002 0.016472 0.735818 0.0817081 0.00061786
thread26::square_grad 170 16.933 0.017313 0.775393 0.0996059 0.000390378
thread26::mean_grad 170 15.5973 0.018425 1.54992 0.091749 0.000359585
thread26::fill_constant 334 14.475 0.006275 0.502458 0.0433382 0.00033371
thread26::mean 160 13.8796 0.006344 1.08231 0.0867477 0.000319985
thread26::square 149 12.0788 0.012138 1.00908 0.0810654 0.000278467
thread26::fill_constant_batch_size_like 164 12.0125 0.01159 0.885307 0.0732472 0.00027694
thread26::elementwise_sub_grad 153 11.609 0.012805 0.717539 0.0758755 0.000267636
thread26::elementwise_mul 153 10.3314 0.014741 0.463774 0.0675253 0.000238182
thread26::send 118 7.77325 0.009705 0.333087 0.065875 0.000179207
thread26::create_double_buffer_reader 159 1.21812 0.001667 0.067506 0.00766111 2.80828e-05
thread26::split_byref 18 0.787693 0.026162 0.119938 0.0437607 1.81597e-05
thread25::fetch_barrier 5 30933.8 437.574 14413 6186.75 0.515156
thread25::sequence_conv_grad 1219 4853.94 1.60333 19.3605 3.9819 0.0808351
thread25::batch_norm_grad 675 3689.35 2.17369 19.1139 5.4657 0.0614406
thread25::mul_grad 651 3301.39 1.64546 34.7968 5.07127 0.0549798
thread25::batch_norm 669 2826.82 1.78971 14.6573 4.22545 0.0470765
thread25::sequence_conv 1324 2687.1 0.838679 23.8879 2.02953 0.0447496
thread25::mul 635 1752.46 0.793336 18.6076 2.75978 0.0291846
thread25::sum 934 1413.03 0.076512 6.8735 1.51288 0.023532
thread25::concat 1101 1078.8 0.020892 28.0689 0.979832 0.0179657
thread25::elementwise_add 2064 960.163 0.014645 6.06984 0.465195 0.0159901
thread25::elementwise_add_grad 2219 916.183 0.022633 7.32887 0.412881 0.0152577
thread25::lookup_table 1737 728.186 0.023995 5.53981 0.419221 0.0121269
thread25::sequence_pool 1287 689.526 0.207364 3.43817 0.535763 0.011483
thread25::send_barrier 6 684.496 25.769 321.357 114.083 0.0113993
thread25::lookup_table_grad 1833 484.109 0.021082 7.39796 0.264108 0.00806212
thread25::sequence_pool_grad 1280 480.344 0.057818 5.83773 0.375269 0.00799941
thread25::concat_grad 945 452.635 0.108362 3.0855 0.478978 0.00753795
thread25::tanh 1923 404.569 0.074186 0.595578 0.210385 0.0067375
thread25::tanh_grad 1931 312.59 0.063312 0.932865 0.16188 0.00520573
thread25::scale 693 300.676 0.07009 5.41297 0.433876 0.00500732
thread25::broadcast 121 217.561 0.186888 4.23608 1.79803 0.00362316
thread25::reduce 143 199.914 0.184808 17.0769 1.398 0.00332927
thread25::cos_sim_grad 158 158.217 0.201364 7.96774 1.00137 0.00263486
thread25::read 340 84.908 0.035631 2.84462 0.249729 0.00141402
thread25::split_selected_rows 7 67.0861 2.74846 19.1169 9.58374 0.00111722
thread25::auc 340 60.2585 0.045785 1.32191 0.177231 0.00100351
thread25::cos_sim 154 50.2302 0.155414 1.36856 0.32617 0.000836508
thread25::cast 176 44.0181 0.013997 4.08449 0.250103 0.000733056
thread25::recv 118 32.212 0.014443 0.989908 0.272983 0.000536442
thread25::elementwise_mul_grad 173 28.0014 0.027063 1.3378 0.161858 0.000466322
thread25::elementwise_sub 299 23.7887 0.016141 0.633676 0.0795607 0.000396165
thread25::square_grad 163 16.3627 0.020534 0.973175 0.100385 0.000272496
thread25::mean 158 15.9343 0.010323 1.56872 0.10085 0.000265362
thread25::elementwise_sub_grad 155 15.7587 0.017256 0.958733 0.101669 0.000262437
thread25::mean_grad 184 15.5612 0.018803 0.948857 0.0845719 0.000259149
thread25::square 153 15.5063 0.011852 0.933417 0.101349 0.000258235
thread25::fill_constant 314 14.7903 0.005365 0.442019 0.047103 0.000246311
thread25::fill_constant_batch_size_like 159 13.124 0.0106 0.82471 0.0825408 0.000218561
thread25::send 136 12.5483 0.008305 4.58592 0.0922669 0.000208973
thread25::elementwise_mul 131 9.44994 0.020559 0.610193 0.0721369 0.000157375
thread25::create_double_buffer_reader 181 1.17207 0.002304 0.042728 0.00647554 1.95191e-05
thread25::split_byref 17 0.875646 0.021916 0.113188 0.0515086 1.45826e-05
thread24::fetch_barrier 3 31510.2 680.83 17143 10503.4 0.523696
thread24::sequence_conv_grad 1246 4966.83 1.63136 16.3844 3.98622 0.0825481
thread24::batch_norm_grad 629 3910.18 2.19468 19.4211 6.2165 0.0649867
thread24::mul_grad 633 3127.02 1.68169 34.822 4.94001 0.0519707
thread24::batch_norm 615 2831.03 1.80115 10.5274 4.60331 0.0470514
thread24::sequence_conv 1255 2626 0.856488 25.6383 2.09243 0.0436437
thread24::mul 610 1825.41 0.794643 13.4341 2.99247 0.0303381
thread24::sum 1001 1517.77 0.080132 7.43092 1.51625 0.0252251
thread24::concat 1108 1043.85 0.019347 36.0599 0.942104 0.0173487
thread24::elementwise_add 2113 958.119 0.017199 5.10241 0.45344 0.0159238
thread24::elementwise_add_grad 1983 793.357 0.022405 7.8507 0.400079 0.0131855
thread24::lookup_table 1746 751.488 0.020903 6.00349 0.430405 0.0124896
thread24::sequence_pool 1252 715.856 0.211429 23.5023 0.57177 0.0118974
thread24::concat_grad 921 485.583 0.114623 3.30242 0.527235 0.00807033
thread24::lookup_table_grad 1667 445.377 0.023916 6.35675 0.267173 0.00740211
thread24::sequence_pool_grad 1267 434.856 0.051095 4.9469 0.343217 0.00722724
thread24::tanh 1861 398.152 0.083029 0.743202 0.213945 0.00661724
thread24::tanh_grad 1869 307.026 0.053188 0.592716 0.164273 0.00510273
thread24::scale 611 263.829 0.067502 5.44807 0.431799 0.0043848
thread24::reduce 124 243.534 0.217919 19.5786 1.96398 0.0040475
thread24::broadcast 119 214.641 0.14022 4.72021 1.80371 0.00356731
thread24::send_barrier 4 160.447 23.7894 64.2686 40.1118 0.00266661
thread24::cos_sim_grad 152 143.63 0.21049 3.95445 0.944931 0.0023871
thread24::read 346 79.4223 0.040844 0.995213 0.229544 0.00131999
thread24::auc 328 61.083 0.066111 0.735284 0.186229 0.00101519
thread24::cos_sim 168 55.938 0.145622 1.25635 0.332964 0.000929682
thread24::cast 136 42.5291 0.014048 2.47532 0.312714 0.000706828
thread24::split_selected_rows 4 41.8476 7.75345 12.2965 10.4619 0.000695502
thread24::recv 121 35.3829 0.014853 0.978213 0.29242 0.000588059
thread24::elementwise_mul_grad 167 28.7346 0.027943 1.36036 0.172064 0.000477566
thread24::elementwise_sub 332 26.3158 0.013406 0.79711 0.0792644 0.000437365
thread24::square_grad 182 17.808 0.020503 0.901623 0.0978461 0.000295967
thread24::mean_grad 161 16.0417 0.019502 0.691252 0.0996378 0.000266611
thread24::fill_constant 303 15.3339 0.00569 0.865451 0.0506069 0.000254847
thread24::elementwise_sub_grad 169 14.4496 0.015178 0.846031 0.0855006 0.00024015
thread24::elementwise_mul 173 13.6795 0.016235 0.588444 0.0790721 0.000227351
thread24::mean 145 13.2935 0.009052 0.863991 0.0916796 0.000220937
thread24::square 155 11.774 0.012286 1.13366 0.0759611 0.000195682
thread24::fill_constant_batch_size_like 171 11.3734 0.010112 0.852956 0.0665113 0.000189025
thread24::send 129 8.22006 0.010999 0.623758 0.0637214 0.000136616
thread24::create_double_buffer_reader 129 0.841683 0.002289 0.04158 0.00652467 1.39887e-05
thread24::split_byref 13 0.64641 0.030271 0.074567 0.0497238 1.07432e-05
thread23::fetch_barrier 1 15748.4 15748.4 15748.4 15748.4 0.356557
thread23::sequence_conv_grad 1251 5067.39 1.64766 22.9841 4.05067 0.11473
thread23::batch_norm_grad 645 3827.23 2.17905 19.4489 5.93369 0.0866517
thread23::mul_grad 628 3278.65 1.6527 28.3492 5.22077 0.0742313
thread23::batch_norm 614 2827.99 1.78196 17.7321 4.60585 0.0640281
thread23::sequence_conv 1230 2609.79 0.856069 23.7345 2.12178 0.059088
thread23::mul 639 1839.6 0.795393 17.9971 2.87887 0.0416501
thread23::sum 907 1505.37 0.071164 11.3352 1.65973 0.0340829
thread23::concat 1076 976.14 0.017187 14.036 0.907193 0.0221006
thread23::elementwise_add 1996 960.15 0.016011 7.33187 0.481037 0.0217386
thread23::elementwise_add_grad 1959 828.429 0.022723 6.98961 0.422883 0.0187563
thread23::lookup_table 1755 779.488 0.021101 6.19306 0.444153 0.0176483
thread23::sequence_pool 1238 704.49 0.218145 23.4939 0.569055 0.0159503
thread23::concat_grad 838 439.396 0.117301 6.18133 0.524338 0.00994829
thread23::sequence_pool_grad 1149 420.364 0.05757 5.03714 0.365852 0.0095174
thread23::lookup_table_grad 1442 414.798 0.021351 7.8668 0.287655 0.00939138
thread23::tanh 1841 398.78 0.0789 0.992387 0.21661 0.00902871
thread23::tanh_grad 1771 278.998 0.060921 0.450839 0.157537 0.00631675
thread23::scale 640 268.167 0.0704 5.97407 0.419011 0.00607152
thread23::broadcast 115 209.938 0.164276 4.04969 1.82555 0.00475317
thread23::cos_sim_grad 143 144.262 0.206992 5.82457 1.00882 0.0032662
thread23::reduce 105 138.416 0.220743 21.0082 1.31825 0.00313386
thread23::read 356 86.7979 0.039844 1.18606 0.243814 0.00196518
thread23::auc 325 54.3156 0.05991 1.13214 0.167125 0.00122975
thread23::cos_sim 154 51.9684 0.146442 1.84068 0.337457 0.00117661
thread23::split_selected_rows 8 47.6175 2.93959 12.3649 5.95219 0.0010781
thread23::cast 157 40.3018 0.014366 2.39133 0.256699 0.000912466
thread23::recv 123 34.2138 0.01745 0.960046 0.278161 0.000774629
thread23::elementwise_mul_grad 158 25.9892 0.032066 1.05476 0.164489 0.000588417
thread23::elementwise_sub 294 24.0235 0.015316 0.93897 0.0817127 0.000543913
thread23::square_grad 158 21.4463 0.017573 1.49274 0.135736 0.000485562
thread23::fill_constant 321 15.8355 0.006114 0.511964 0.0493319 0.00035853
thread23::send_barrier 1 15.6305 15.6305 15.6305 15.6305 0.000353889
thread23::mean_grad 157 15.0769 0.015782 0.743207 0.0960311 0.000341353
thread23::elementwise_sub_grad 141 13.7607 0.016335 0.962749 0.097594 0.000311555
thread23::square 159 13.1298 0.012472 1.11909 0.0825775 0.00029727
thread23::mean 168 12.4649 0.00742 0.848475 0.074196 0.000282217
thread23::fill_constant_batch_size_like 157 11.4516 0.014066 0.547144 0.0729399 0.000259273
thread23::elementwise_mul 146 9.83182 0.016029 0.473233 0.0673412 0.000222601
thread23::send 104 5.96769 0.015402 0.343579 0.0573816 0.000135113
thread23::create_double_buffer_reader 148 1.1261 0.001996 0.064732 0.00760875 2.54957e-05
thread23::split_byref 17 0.774611 0.023288 0.095191 0.0455654 1.75378e-05
thread22::fetch_barrier 2 14786.6 743.354 14043.2 7393.28 0.341354
thread22::sequence_conv_grad 1229 4973.41 1.61787 27.6037 4.04671 0.114813
thread22::batch_norm_grad 632 3738.82 2.19715 18.192 5.91585 0.0863121
thread22::mul_grad 622 3092.59 1.65701 29.8767 4.97202 0.0713937
thread22::batch_norm 617 2865.72 1.76883 13.0133 4.6446 0.0661562
thread22::sequence_conv 1243 2689.35 0.903958 23.5036 2.1636 0.0620847
thread22::mul 587 1802.24 0.796739 19.3855 3.07026 0.0416054
thread22::sum 961 1533.37 0.071646 8.6024 1.5956 0.0353984
thread22::concat 1153 1055.52 0.020121 27.4968 0.91546 0.0243672
thread22::elementwise_add 1983 914.51 0.015121 4.96554 0.461175 0.0211118
thread22::elementwise_add_grad 2013 866.808 0.025512 6.06976 0.430605 0.0200106
thread22::lookup_table 1607 719.103 0.021146 5.92111 0.447482 0.0166008
thread22::sequence_pool 1263 710.016 0.213486 23.5061 0.562167 0.016391
thread22::sequence_pool_grad 1223 509.589 0.054652 7.62606 0.416671 0.0117641
thread22::concat_grad 980 503.94 0.111721 5.14636 0.514224 0.0116336
thread22::lookup_table_grad 1744 458.777 0.024259 6.88037 0.26306 0.0105911
thread22::tanh 1956 420.797 0.084793 0.670876 0.215131 0.00971425
thread22::tanh_grad 1901 309.951 0.060685 2.49512 0.163046 0.00715533
thread22::scale 630 258.517 0.063831 7.11118 0.410345 0.00596798
thread22::broadcast 116 202.658 0.161211 7.95325 1.74705 0.00467843
thread22::reduce 101 168.375 0.213942 24.9698 1.66708 0.003887
thread22::cos_sim_grad 149 166.679 0.195302 9.74924 1.11865 0.00384784
thread22::read 346 91.5511 0.037408 1.66717 0.264599 0.00211349
thread22::split_selected_rows 9 80.9572 3.46815 23.3136 8.99525 0.00186893
thread22::auc 307 55.4321 0.05891 1.19911 0.180561 0.00127967
thread22::cos_sim 161 51.3252 0.151488 0.92215 0.31879 0.00118486
thread22::cast 148 41.123 0.012856 2.13414 0.277858 0.000949341
thread22::send_barrier 1 41.0655 41.0655 41.0655 41.0655 0.000948013
thread22::recv 113 34.3274 0.016812 0.964399 0.303783 0.000792462
thread22::elementwise_mul_grad 179 26.7179 0.029269 0.920049 0.149262 0.000616792
thread22::elementwise_sub 295 24.7606 0.014389 0.790467 0.0839343 0.000571609
thread22::mean_grad 167 16.9989 0.017961 1.05374 0.10179 0.000392427
thread22::mean 170 15.9548 0.007407 1.45864 0.0938516 0.000368322
thread22::elementwise_sub_grad 154 15.848 0.017855 1.11728 0.102909 0.000365856
thread22::square 160 15.244 0.011063 0.993673 0.095275 0.000351914
thread22::fill_constant 326 13.3108 0.005138 0.411298 0.0408307 0.000307285
thread22::square_grad 138 12.8418 0.019366 0.502885 0.0930566 0.000296458
thread22::fill_constant_batch_size_like 155 12.5914 0.011875 0.757514 0.0812346 0.000290677
thread22::elementwise_mul 144 10.336 0.01674 0.472379 0.071778 0.000238611
thread22::send 121 8.31588 0.013572 0.486433 0.0687263 0.000191975
thread22::create_double_buffer_reader 148 0.945978 0.002003 0.032536 0.00639174 2.18383e-05
thread22::split_byref 11 0.494659 0.027059 0.077313 0.044969 1.14194e-05
thread21::fetch_barrier 3 43097.6 13512 15105.3 14365.9 0.600079
thread21::sequence_conv_grad 1301 5043.71 1.61192 26.0295 3.8768 0.0702271
thread21::batch_norm_grad 638 3593.27 2.19122 19.0793 5.63208 0.0500316
thread21::mul_grad 645 3324.28 1.62898 23.5786 5.15392 0.0462863
thread21::batch_norm 645 2762.26 1.77527 11.2422 4.28258 0.0384609
thread21::sequence_conv 1272 2679.1 0.886793 24.0392 2.10621 0.037303
thread21::mul 634 1755.83 0.792558 14.2177 2.76945 0.0244477
thread21::sum 897 1365.44 0.075036 9.90257 1.52223 0.0190119
thread21::concat 1161 1121.15 0.02052 33.9789 0.965675 0.0156105
thread21::elementwise_add 2072 1037.15 0.015222 6.70529 0.500553 0.0144409
thread21::elementwise_add_grad 2042 874.969 0.020921 9.61363 0.428486 0.0121828
thread21::lookup_table 1710 737.942 0.020939 5.27826 0.431545 0.0102749
thread21::sequence_pool 1233 698.425 0.220104 23.4903 0.566444 0.00972466
thread21::lookup_table_grad 1626 489.865 0.023096 7.62985 0.30127 0.00682074
thread21::sequence_pool_grad 1158 479.28 0.050558 4.58579 0.413886 0.00667335
thread21::concat_grad 908 473.634 0.108708 4.10501 0.521624 0.00659474
thread21::tanh 1969 405.569 0.082105 0.620371 0.205977 0.00564702
thread21::scale 667 309.897 0.057911 7.42818 0.464613 0.00431491
thread21::tanh_grad 1794 282.393 0.064327 0.755469 0.15741 0.00393195
thread21::send_barrier 5 228.128 17.1887 78.7203 45.6256 0.00317639
thread21::broadcast 114 199.341 0.239162 4.67631 1.74861 0.00277557
thread21::reduce 117 181.01 0.205287 16.8949 1.54709 0.00252033
thread21::cos_sim_grad 146 141.273 0.182995 6.5603 0.967627 0.00196705
thread21::split_selected_rows 14 101.755 2.71536 12.9298 7.26823 0.00141681
thread21::read 308 82.4812 0.045187 1.18173 0.267796 0.00114844
thread21::auc 319 55.9328 0.050707 0.856863 0.175338 0.000778792
thread21::cos_sim 138 45.0327 0.153096 2.27312 0.326324 0.000627022
thread21::cast 151 39.6235 0.014561 2.24225 0.262407 0.000551705
thread21::recv 115 33.5653 0.019093 0.909094 0.291872 0.000467353
thread21::elementwise_mul_grad 175 32.4755 0.029163 1.19356 0.185574 0.000452179
thread21::elementwise_sub 316 25.6867 0.012984 0.934494 0.0812869 0.000357653
thread21::square_grad 167 18.8069 0.018373 0.888387 0.112616 0.000261861
thread21::mean_grad 154 16.1777 0.017966 0.786458 0.10505 0.000225253
thread21::elementwise_sub_grad 149 14.1615 0.016446 0.821828 0.0950437 0.000197181
thread21::square 160 13.7496 0.011698 1.00944 0.0859352 0.000191446
thread21::fill_constant 301 13.5626 0.004613 0.837845 0.0450585 0.000188842
thread21::fill_constant_batch_size_like 164 12.9193 0.012481 0.617164 0.0787764 0.000179885
thread21::mean 159 12.0693 0.007209 1.21087 0.0759074 0.000168049
thread21::elementwise_mul 164 11.5726 0.015035 0.578837 0.0705647 0.000161134
thread21::send 113 6.95984 0.010235 0.314117 0.0615915 9.69067e-05
thread21::create_double_buffer_reader 161 1.14071 0.001832 0.035779 0.00708519 1.5883e-05
thread21::split_byref 14 0.750289 0.025825 0.091782 0.0535921 1.04468e-05
thread20::fetch_barrier 2 32484 15390.8 17093.3 16242 0.531606
thread20::sequence_conv_grad 1195 4946.23 1.65047 16.3378 4.1391 0.0809458
thread20::batch_norm_grad 592 3913.02 2.19879 22.7582 6.60982 0.0640371
thread20::mul_grad 567 3027.3 1.67098 22.5027 5.33916 0.0495423
thread20::batch_norm 591 2945.95 1.80455 13.4534 4.98468 0.0482109
thread20::sequence_conv 1187 2711.58 0.823018 24.5853 2.2844 0.0443754
thread20::mul 590 1835.91 0.793663 14.3754 3.11171 0.0300449
thread20::sum 1018 1678.85 0.08516 11.4555 1.64917 0.0274747
thread20::concat 1122 959.818 0.017109 14.0732 0.855452 0.0157076
thread20::elementwise_add 1837 917.837 0.015855 7.09625 0.499639 0.0150205
thread20::elementwise_add_grad 2000 825.078 0.024269 10.5409 0.412539 0.0135025
thread20::lookup_table 1777 718.606 0.021023 5.97393 0.404393 0.0117601
thread20::sequence_pool 1236 715.487 0.221941 23.672 0.578873 0.0117091
thread20::sequence_pool_grad 1209 468.712 0.061316 6.35119 0.387685 0.00767054
thread20::concat_grad 856 466.591 0.112875 3.67769 0.545083 0.00763584
thread20::lookup_table_grad 1576 446.736 0.026273 6.63615 0.283462 0.00731091
thread20::tanh 1682 379.625 0.075994 0.76396 0.225699 0.00621263
thread20::tanh_grad 1799 296.848 0.062923 0.868369 0.165007 0.00485796
thread20::broadcast 118 212.588 0.148947 7.46339 1.80159 0.00347903
thread20::scale 491 200.677 0.069581 6.6348 0.408712 0.00328412
thread20::cos_sim_grad 149 148.464 0.222818 13.8599 0.996402 0.00242963
thread20::send_barrier 4 148.398 25.6223 52.7368 37.0994 0.00242855
thread20::reduce 124 145.19 0.164745 16.593 1.17088 0.00237605
thread20::read 276 91.3446 0.037494 1.23626 0.330959 0.00149487
thread20::split_selected_rows 7 65.8375 2.82463 16.2701 9.40536 0.00107744
thread20::auc 313 56.4912 0.051144 1.07221 0.180483 0.000924487
thread20::cast 159 49.1462 0.015337 3.85186 0.309096 0.000804285
thread20::cos_sim 144 47.2949 0.154546 0.936602 0.328437 0.000773988
thread20::recv 114 37.0336 0.010092 0.9768 0.324857 0.000606061
thread20::elementwise_mul_grad 158 27.6752 0.030488 2.36265 0.175159 0.000452909
thread20::elementwise_sub 317 27.1023 0.014944 1.09668 0.0854962 0.000443533
thread20::square_grad 172 15.5784 0.017796 0.945176 0.0905722 0.000254943
thread20::mean 150 14.6372 0.008604 1.17129 0.0975816 0.000239541
thread20::fill_constant 302 13.7302 0.005231 0.567153 0.0454641 0.000224696
thread20::elementwise_sub_grad 163 13.2582 0.014845 0.654758 0.0813386 0.000216972
thread20::elementwise_mul 156 13.0262 0.014858 0.785309 0.0835014 0.000213176
thread20::fill_constant_batch_size_like 149 11.6114 0.011643 1.14918 0.0779291 0.000190023
thread20::mean_grad 135 11.5551 0.017356 0.722703 0.0855936 0.000189102
thread20::square 128 8.34403 0.010894 0.486216 0.0651878 0.000136551
thread20::send 108 6.80019 0.011454 0.350981 0.0629647 0.000111286
thread20::create_double_buffer_reader 136 0.828601 0.002411 0.029964 0.00609265 1.35602e-05
thread20::split_byref 14 0.63609 0.028396 0.070724 0.045435 1.04097e-05
thread19::fetch_barrier 4 29029.1 694.832 14407.1 7257.28 0.502734
thread19::sequence_conv_grad 1212 5016.36 1.63789 21.0147 4.13891 0.0868746
thread19::batch_norm_grad 595 3631.33 2.20523 26.3294 6.10308 0.0628884
thread19::mul_grad 631 3265.16 1.66204 22.8498 5.17458 0.0565469
thread19::batch_norm 597 2827.81 1.777 15.6547 4.73669 0.0489727
thread19::sequence_conv 1212 2666.31 0.850494 23.9727 2.19993 0.0461759
thread19::mul 609 1810.62 0.800135 14.9569 2.97311 0.0313568
thread19::sum 953 1477.14 0.066221 10.074 1.54999 0.0255815
thread19::concat 1054 993.09 0.02061 19.5554 0.94221 0.0171986
thread19::elementwise_add 2055 960.38 0.016906 7.3865 0.467338 0.0166321
thread19::elementwise_add_grad 2027 880.131 0.02392 8.99847 0.434204 0.0152423
thread19::lookup_table 1771 723.131 0.020133 4.70301 0.408318 0.0125234
thread19::sequence_pool 1226 710.665 0.214332 23.9766 0.579662 0.0123075
thread19::sequence_pool_grad 1262 473.038 0.064279 8.04285 0.374832 0.0081922
thread19::concat_grad 920 463.55 0.111541 4.14919 0.503859 0.00802788
thread19::lookup_table_grad 1669 416.251 0.025078 5.71805 0.249401 0.00720875
thread19::tanh 1838 409.06 0.082525 0.662879 0.222557 0.00708421
thread19::tanh_grad 1885 314.329 0.058754 2.51797 0.166753 0.00544363
thread19::scale 652 286.677 0.07041 6.85159 0.439689 0.00496475
thread19::reduce 125 266.04 0.19716 19.4594 2.12832 0.00460736
thread19::send_barrier 4 227.197 45.9536 69.7229 56.7991 0.00393465
thread19::broadcast 112 223.76 0.135387 7.8345 1.99786 0.00387514
thread19::cos_sim_grad 147 174.189 0.206784 9.95114 1.18496 0.00301665
thread19::read 274 88.7967 0.036668 1.21082 0.324076 0.00153781
thread19::auc 309 53.7931 0.05167 0.848008 0.174088 0.000931603
thread19::split_selected_rows 9 52.4788 2.69046 11.1881 5.83098 0.000908842
thread19::cos_sim 157 50.3278 0.15827 1.54877 0.32056 0.000871591
thread19::cast 159 45.5306 0.014469 5.92833 0.286356 0.000788512
thread19::recv 117 32.552 0.016257 0.95485 0.278223 0.000563745
thread19::elementwise_mul_grad 165 32.4768 0.029171 2.08277 0.196829 0.000562441
thread19::elementwise_sub 301 23.701 0.012919 1.00067 0.0787408 0.00041046
thread19::square_grad 149 16.2087 0.019526 0.949605 0.108783 0.000280706
thread19::mean_grad 150 15.1723 0.01785 0.872939 0.101149 0.000262758
thread19::fill_constant 317 15.0497 0.005363 0.551595 0.0474754 0.000260635
thread19::elementwise_sub_grad 147 14.1757 0.014315 1.0625 0.0964332 0.000245498
thread19::elementwise_mul 154 14.101 0.017836 1.15315 0.0915652 0.000244206
thread19::square 154 11.1589 0.012304 0.754724 0.0724602 0.000193252
thread19::mean 140 10.383 0.008666 0.597313 0.0741645 0.000179816
thread19::fill_constant_batch_size_like 128 9.6085 0.011273 0.94739 0.0750664 0.000166403
thread19::send 117 9.60774 0.014536 2.54194 0.0821174 0.000166389
thread19::create_double_buffer_reader 181 1.08848 0.001847 0.03621 0.00601373 1.88507e-05
thread19::split_byref 19 0.943578 0.029156 0.089259 0.049662 1.63411e-05
thread18::sequence_conv_grad 1283 4889.44 1.59985 22.6447 3.81094 0.167788
thread18::batch_norm_grad 656 3486.26 2.18306 19.9326 5.31442 0.119636
thread18::mul_grad 701 3364.22 1.64713 24.1475 4.79917 0.115448
thread18::batch_norm 664 2753.1 1.77211 11.4267 4.14624 0.0944768
thread18::sequence_conv 1335 2680.15 0.849999 24.0253 2.0076 0.0919733
thread18::mul 647 1786.17 0.797252 19.5658 2.76069 0.0612949
thread18::sum 997 1521.71 0.072455 9.45782 1.52629 0.0522198
thread18::concat 1171 1090.52 0.022406 27.5154 0.931272 0.0374228
thread18::elementwise_add 2136 973.014 0.016123 6.74549 0.455531 0.0333904
thread18::elementwise_add_grad 2087 906.736 0.022122 7.79538 0.434469 0.031116
thread18::lookup_table 1750 734.438 0.022908 5.98871 0.419679 0.0252033
thread18::sequence_pool 1273 694.163 0.217895 23.6397 0.545297 0.0238212
thread18::sequence_pool_grad 1293 505.837 0.058987 6.22779 0.391212 0.0173585
thread18::concat_grad 968 487.321 0.11239 5.53998 0.503431 0.0167231
thread18::lookup_table_grad 1798 477.427 0.021234 5.94184 0.265532 0.0163836
thread18::fetch_barrier 1 435.134 435.134 435.134 435.134 0.0149323
thread18::tanh 1984 409.984 0.077096 0.668421 0.206645 0.0140692
thread18::tanh_grad 1888 302.635 0.066295 1.40691 0.160294 0.0103854
thread18::scale 695 292.926 0.072012 3.98964 0.421476 0.0100522
thread18::send_barrier 4 253.891 43.5421 98.0797 63.4728 0.00871265
thread18::broadcast 121 213.104 0.16573 5.15323 1.76119 0.00731296
thread18::reduce 108 210.507 0.206602 23.0571 1.94914 0.00722385
thread18::cos_sim_grad 163 169.96 0.218728 7.2493 1.0427 0.00583244
thread18::read 340 84.8192 0.040729 2.33291 0.249468 0.00291069
thread18::split_selected_rows 10 76.0063 4.43824 10.6106 7.60063 0.00260827
thread18::auc 308 56.8706 0.057807 1.36348 0.184645 0.0019516
thread18::cos_sim 161 50.5872 0.150065 1.50996 0.314206 0.00173597
thread18::cast 161 35.768 0.015962 1.82048 0.222161 0.00122743
thread18::recv 111 32.7464 0.016446 1.02713 0.295012 0.00112374
thread18::elementwise_mul_grad 157 27.1668 0.028093 1.34278 0.173037 0.000932269
thread18::elementwise_sub 310 26.3509 0.014819 0.887087 0.0850029 0.000904269
thread18::elementwise_sub_grad 175 15.0321 0.015779 1.14287 0.0858978 0.000515849
thread18::mean_grad 164 14.399 0.01902 0.811885 0.0877989 0.000494123
thread18::mean 178 13.6458 0.006926 0.772702 0.0766616 0.000468274
thread18::square 178 13.3084 0.010695 0.77262 0.0747663 0.000456697
thread18::square_grad 150 12.8707 0.019083 1.04247 0.0858048 0.000441678
thread18::fill_constant 291 12.3099 0.005418 0.807463 0.0423019 0.000422431
thread18::fill_constant_batch_size_like 138 11.1599 0.011623 0.490275 0.0808685 0.000382967
thread18::elementwise_mul 142 10.5333 0.018157 0.820505 0.0741784 0.000361466
thread18::send 120 6.31916 0.012475 0.304842 0.0526597 0.000216851
thread18::create_double_buffer_reader 160 1.07992 0.001885 0.048253 0.00674953 3.70592e-05
thread18::split_byref 23 0.920456 0.024903 0.089584 0.0400198 3.15868e-05
thread17::fetch_barrier 1 13792.4 13792.4 13792.4 13792.4 0.323741
thread17::sequence_conv_grad 1259 5132.17 1.62182 27.9581 4.07639 0.120464
thread17::batch_norm_grad 619 3809.58 2.2067 18.8413 6.15441 0.08942
thread17::mul_grad 589 3073.5 1.68779 25.6828 5.21817 0.0721426
thread17::batch_norm 614 2794.58 1.77637 12.1982 4.55143 0.0655955
thread17::sequence_conv 1244 2689.63 0.895704 23.1494 2.16208 0.0631321
thread17::mul 637 1883.09 0.797644 17.8217 2.95618 0.0442006
thread17::sum 908 1470 0.077417 8.84333 1.61894 0.0345045
thread17::concat 1093 1030.72 0.018862 29.2485 0.943023 0.0241936
thread17::elementwise_add 1996 962.778 0.015652 8.02674 0.482354 0.0225987
thread17::elementwise_add_grad 1996 796.214 0.020176 7.1779 0.398905 0.0186891
thread17::lookup_table 1655 732.101 0.022726 6.29599 0.442357 0.0171842
thread17::sequence_pool 1226 721.587 0.221589 23.7239 0.58857 0.0169374
thread17::concat_grad 897 491.312 0.11471 4.03648 0.547728 0.0115323
thread17::sequence_pool_grad 1246 462.505 0.055023 6.25459 0.371192 0.0108561
thread17::lookup_table_grad 1595 454.61 0.022855 10.5289 0.285022 0.0106708
thread17::tanh 1872 406.328 0.075647 0.855471 0.217055 0.00953749
thread17::tanh_grad 1926 320.522 0.064224 1.13814 0.166419 0.00752343
thread17::send_barrier 5 284.243 23.279 85.5019 56.8486 0.00667187
thread17::scale 715 278.659 0.070483 5.46589 0.389733 0.0065408
thread17::broadcast 102 178.489 0.181334 4.96293 1.7499 0.00418958
thread17::cos_sim_grad 156 145.74 0.212049 7.024 0.934234 0.00342088
thread17::reduce 96 133.023 0.22829 11.2894 1.38565 0.00312237
thread17::split_selected_rows 12 113.68 5.1527 25.6101 9.47333 0.00266834
thread17::read 290 80.5073 0.036007 1.25773 0.277612 0.0018897
thread17::auc 321 60.9137 0.060799 1.56961 0.189762 0.00142979
thread17::cos_sim 167 53.228 0.152368 1.17021 0.318731 0.00124939
thread17::cast 156 41.0111 0.013253 2.69204 0.262891 0.000962629
thread17::recv 105 38.3102 0.021659 0.994597 0.364859 0.000899234
thread17::elementwise_sub 326 29.498 0.014647 1.22714 0.0904847 0.00069239
thread17::elementwise_mul_grad 147 25.5183 0.024613 1.30679 0.173594 0.000598977
thread17::square_grad 151 15.7288 0.01585 0.709834 0.104164 0.000369194
thread17::elementwise_sub_grad 157 15.2698 0.016281 1.02174 0.0972598 0.000358419
thread17::fill_constant 328 14.5891 0.005024 0.589343 0.0444788 0.000342441
thread17::square 155 14.3665 0.012528 1.2739 0.0926869 0.000337216
thread17::elementwise_mul 178 13.9116 0.016058 0.543435 0.0781548 0.000326538
thread17::mean 166 13.4263 0.010344 0.685308 0.0808811 0.000315147
thread17::mean_grad 142 12.4334 0.019817 0.816257 0.0875591 0.000291842
thread17::fill_constant_batch_size_like 137 9.32519 0.014808 0.640509 0.0680671 0.000218885
thread17::send 98 5.45294 0.011199 0.503958 0.0556422 0.000127994
thread17::create_double_buffer_reader 161 1.16345 0.002039 0.083364 0.0072264 2.7309e-05
thread17::split_byref 19 1.09677 0.029116 0.141838 0.0577247 2.57438e-05
thread16::sequence_conv_grad 1234 5068.08 1.5901 21.3976 4.10703 0.172471
thread16::batch_norm_grad 584 3663.71 2.17496 18.2359 6.27347 0.124679
thread16::mul_grad 624 3264.71 1.64975 28.2693 5.23191 0.111101
thread16::batch_norm 624 2835.74 1.78641 11.8513 4.54445 0.0965024
thread16::sequence_conv 1205 2664.63 0.829259 23.6789 2.21131 0.0906793
thread16::mul 581 1745.44 0.78887 15.6111 3.0042 0.0593986
thread16::sum 917 1510.83 0.075098 9.22761 1.64757 0.0514146
thread16::concat 1111 1039.28 0.020066 26.2739 0.935445 0.0353675
thread16::elementwise_add 2005 1006.38 0.016635 8.10163 0.501937 0.034248
thread16::elementwise_add_grad 1948 842.286 0.022934 10.0185 0.432385 0.0286636
thread16::fetch_barrier 1 815.554 815.554 815.554 815.554 0.027754
thread16::lookup_table 1790 730.096 0.020958 6.01011 0.407875 0.0248457
thread16::sequence_pool 1266 722.255 0.210311 23.5631 0.570501 0.0245789
thread16::concat_grad 940 471.088 0.117968 3.53152 0.501157 0.0160315
thread16::sequence_pool_grad 1247 446.075 0.063025 5.57028 0.357719 0.0151803
thread16::lookup_table_grad 1786 439.48 0.026277 5.27401 0.24607 0.0149559
thread16::tanh 1873 412.731 0.084586 0.640052 0.220358 0.0140456
thread16::tanh_grad 1835 303.294 0.061538 0.632217 0.165283 0.0103213
thread16::scale 581 243.337 0.066817 4.22531 0.418824 0.00828094
thread16::broadcast 127 219.656 0.149015 5.68518 1.72958 0.00747507
thread16::reduce 113 179.939 0.204034 17.7762 1.59238 0.00612348
thread16::cos_sim_grad 149 145.174 0.192264 9.95585 0.974321 0.00494038
thread16::send_barrier 3 116.097 18.4953 70.2662 38.6989 0.00395086
thread16::read 274 91.7951 0.042797 1.39986 0.335019 0.00312386
thread16::auc 295 50.5177 0.056365 0.667877 0.171246 0.00171916
thread16::cos_sim 153 48.8883 0.156588 1.74156 0.319531 0.00166371
thread16::split_selected_rows 5 46.7155 5.59107 20.7688 9.3431 0.00158977
thread16::cast 159 43.7482 0.016281 2.4619 0.275146 0.00148879
thread16::recv 114 36.1554 0.017134 0.996306 0.317152 0.0012304
thread16::elementwise_mul_grad 155 29.9651 0.029798 1.62637 0.193323 0.00101974
thread16::elementwise_sub 337 25.7691 0.014028 1.17152 0.0764663 0.000876944
thread16::elementwise_sub_grad 177 18.449 0.014027 1.02732 0.104232 0.000627834
thread16::fill_constant_batch_size_like 161 15.0814 0.010985 0.865759 0.0936733 0.000513232
thread16::fill_constant 329 14.8989 0.004834 0.618648 0.0452855 0.000507022
thread16::mean_grad 165 14.8855 0.017274 0.721135 0.0902151 0.000506565
thread16::square_grad 147 14.34 0.019057 0.688626 0.097551 0.000488001
thread16::elementwise_mul 171 14.2461 0.019459 0.942034 0.0833104 0.000484805
thread16::square 184 13.3345 0.011366 0.68369 0.0724703 0.000453785
thread16::mean 159 12.1153 0.007243 0.60343 0.0761968 0.000412293
thread16::send 98 7.02674 0.013411 0.38939 0.0717014 0.000239125
thread16::create_double_buffer_reader 125 0.918931 0.002346 0.039607 0.00735145 3.12719e-05
thread16::split_byref 10 0.464999 0.031692 0.074876 0.0464999 1.58243e-05
thread15::fetch_barrier 4 28538 699.537 13611.6 7134.49 0.499558
thread15::sequence_conv_grad 1226 4908.07 1.67294 19.4151 4.00332 0.085916
thread15::batch_norm_grad 653 3880.79 2.19572 25.2117 5.94302 0.0679334
thread15::mul_grad 621 2980.78 1.65075 41.2276 4.79997 0.0521787
thread15::batch_norm 641 2724.24 1.7821 11.7414 4.24999 0.047688
thread15::sequence_conv 1308 2657.41 0.794802 23.4899 2.03166 0.0465181
thread15::mul 680 1884.38 0.790369 17.0555 2.77115 0.0329862
thread15::sum 974 1558.13 0.069512 8.59767 1.59972 0.0272751
thread15::concat 1123 1103.68 0.019196 26.926 0.9828 0.0193201
thread15::elementwise_add 2236 978.01 0.015688 5.49389 0.437393 0.0171201
thread15::elementwise_add_grad 2161 904.176 0.025663 6.99672 0.418406 0.0158276
thread15::lookup_table 1730 734.793 0.023187 5.58529 0.424736 0.0128626
thread15::sequence_pool 1278 710.625 0.212032 23.5593 0.556045 0.0124395
thread15::sequence_pool_grad 1285 517.193 0.055214 7.18704 0.402485 0.00905349
thread15::concat_grad 930 478.552 0.118851 5.88573 0.514572 0.00837708
thread15::lookup_table_grad 1713 457.079 0.02204 4.8127 0.26683 0.00800119
thread15::tanh 1981 420.033 0.082785 0.612286 0.212031 0.0073527
thread15::tanh_grad 1995 331.735 0.060455 0.930644 0.166283 0.00580703
thread15::scale 611 238.216 0.071588 4.58754 0.389879 0.00416999
thread15::broadcast 114 202.53 0.262423 5.69124 1.77658 0.0035453
thread15::reduce 119 199.016 0.196622 21.1522 1.6724 0.00348379
thread15::cos_sim_grad 142 140.976 0.203025 4.23323 0.992785 0.00246778
thread15::read 292 94.9663 0.040232 2.19612 0.325227 0.00166239
thread15::send_barrier 1 75.7509 75.7509 75.7509 75.7509 0.00132602
thread15::split_selected_rows 9 58.4094 2.75547 10.1974 6.48993 0.00102246
thread15::cos_sim 160 52.1398 0.148024 2.13756 0.325874 0.000912709
thread15::auc 278 50.0978 0.049326 0.850952 0.180208 0.000876963
thread15::cast 160 37.339 0.013909 3.56748 0.233369 0.000653621
thread15::recv 111 32.4296 0.010139 0.967537 0.292158 0.000567681
thread15::elementwise_sub 338 28.9574 0.013848 0.846196 0.0856729 0.000506901
thread15::elementwise_mul_grad 152 25.6778 0.029086 1.20453 0.168933 0.000449491
thread15::mean_grad 154 17.7711 0.019544 0.772523 0.115397 0.000311084
thread15::elementwise_sub_grad 166 16.3466 0.013365 0.768063 0.0984732 0.000286147
thread15::fill_constant 330 15.1485 0.005467 0.436224 0.0459045 0.000265175
thread15::square_grad 140 15.107 0.01817 0.936365 0.107907 0.00026445
thread15::mean 168 14.5903 0.009699 0.671572 0.086847 0.000255404
thread15::square 167 14.574 0.01129 1.16623 0.0872696 0.000255119
thread15::elementwise_mul 146 10.4047 0.01697 0.487965 0.0712648 0.000182134
thread15::fill_constant_batch_size_like 154 9.39132 0.012448 0.583613 0.0609826 0.000164395
thread15::send 118 7.35312 0.010528 0.356032 0.0623145 0.000128717
thread15::create_double_buffer_reader 162 1.05682 0.001841 0.042849 0.00652358 1.84997e-05
thread15::split_byref 9 0.489016 0.031891 0.096926 0.0543351 8.56025e-06
thread14::fetch_barrier 3 28805.8 598.42 14609.6 9601.95 0.502786
thread14::sequence_conv_grad 1307 5028.04 1.62059 23.1688 3.84701 0.087761
thread14::batch_norm_grad 655 3639.91 2.19145 20.5552 5.55712 0.0635322
thread14::mul_grad 618 3161.16 1.65234 23.809 5.11514 0.0551759
thread14::batch_norm 642 2819.97 1.78863 13.5083 4.39248 0.0492207
thread14::sequence_conv 1294 2656.66 0.805601 23.5749 2.05306 0.0463701
thread14::mul 632 1884.29 0.791656 16.149 2.98147 0.032889
thread14::sum 915 1418.96 0.073638 7.69184 1.55077 0.0247669
thread14::concat 1109 970.6 0.019083 14.4647 0.875202 0.0169412
thread14::elementwise_add 2060 945.667 0.015598 8.56758 0.459062 0.016506
thread14::elementwise_add_grad 2156 902.773 0.023717 6.58863 0.418726 0.0157573
thread14::lookup_table 1742 739.098 0.022104 5.1861 0.424281 0.0129005
thread14::sequence_pool 1261 713.477 0.214482 23.6525 0.565802 0.0124533
thread14::sequence_pool_grad 1371 524.463 0.058202 5.60846 0.38254 0.00915414
thread14::lookup_table_grad 1874 476.327 0.023088 6.89149 0.254177 0.00831398
thread14::concat_grad 982 469.453 0.112823 4.48988 0.478058 0.00819398
thread14::tanh 1853 395.85 0.077268 0.65027 0.213627 0.0069093
thread14::tanh_grad 1974 318.616 0.066807 3.15411 0.161406 0.00556123
thread14::scale 676 260.939 0.066844 6.41976 0.386004 0.00455451
thread14::broadcast 112 213.799 0.150107 5.30079 1.90892 0.00373172
thread14::reduce 122 169.356 0.166509 16.1039 1.38816 0.002956
thread14::cos_sim_grad 178 160.856 0.18307 3.8452 0.903687 0.00280764
thread14::send_barrier 2 122.647 27.5014 95.1461 61.3237 0.00214073
thread14::read 338 88.7393 0.035838 1.29777 0.262542 0.00154888
thread14::auc 345 61.961 0.056183 1.0282 0.179597 0.00108149
thread14::cos_sim 152 50.8639 0.148356 1.19378 0.334631 0.000887796
thread14::split_selected_rows 6 46.867 3.58268 14.5785 7.81117 0.000818032
thread14::cast 151 41.0037 0.016004 2.67795 0.271548 0.000715691
thread14::recv 111 31.1804 0.0151 0.989183 0.280905 0.000544234
thread14::elementwise_mul_grad 170 26.51 0.030063 0.722098 0.155941 0.000462715
thread14::elementwise_sub 311 25.8702 0.014587 0.887262 0.0831838 0.000451546
thread14::elementwise_sub_grad 188 19.3118 0.011793 1.267 0.102722 0.000337074
thread14::mean_grad 147 16.4179 0.019592 0.999505 0.111687 0.000286564
thread14::fill_constant 326 15.3212 0.005819 0.653809 0.0469975 0.000267421
thread14::square_grad 149 12.4598 0.020187 0.922629 0.0836225 0.000217477
thread14::square 163 12.3843 0.010943 0.952218 0.0759771 0.000216159
thread14::fill_constant_batch_size_like 169 11.8987 0.011689 0.567329 0.0704067 0.000207684
thread14::mean 157 11.5393 0.00861 0.820177 0.0734984 0.00020141
thread14::elementwise_mul 156 10.2313 0.016927 0.544329 0.0655852 0.00017858
thread14::send 120 8.4821 0.012462 0.66202 0.0706842 0.000148049
thread14::split_byref 29 1.30912 0.024397 0.094548 0.0451419 2.28497e-05
thread14::create_double_buffer_reader 154 1.29751 0.001968 0.123469 0.00842538 2.26471e-05
thread13::fetch_barrier 5 47342.1 678.779 15462.1 9468.41 0.624303
thread13::sequence_conv_grad 1279 4949.08 1.61172 21.4853 3.86949 0.0652638
thread13::batch_norm_grad 669 3611.42 2.19495 19.2016 5.39824 0.047624
thread13::mul_grad 618 3184.3 1.637 23.4244 5.15259 0.0419916
thread13::batch_norm 664 2733.57 1.82008 11.8729 4.11683 0.0360478
thread13::sequence_conv 1320 2656.05 0.838995 16.1645 2.01216 0.0350255
thread13::mul 699 1828.35 0.791065 14.4808 2.61566 0.0241106
thread13::sum 977 1590.04 0.086002 8.49257 1.62747 0.0209679
thread13::elementwise_add 2204 1045.33 0.016958 22.7691 0.474288 0.0137848
thread13::concat 1072 1024.9 0.020397 27.5078 0.956067 0.0135155
thread13::elementwise_add_grad 2028 874.373 0.023593 5.53368 0.43115 0.0115304
thread13::lookup_table 1699 720.678 0.020974 4.72891 0.424178 0.00950363
thread13::sequence_pool 1298 710.753 0.212638 23.8921 0.547575 0.00937274
thread13::concat_grad 1009 510.448 0.11512 3.96439 0.505895 0.00673132
thread13::sequence_pool_grad 1229 508.286 0.055244 7.02819 0.413577 0.0067028
thread13::lookup_table_grad 1748 466.334 0.021748 8.72934 0.266782 0.00614958
thread13::tanh 2000 407.252 0.074562 1.33797 0.203626 0.00537045
thread13::tanh_grad 1932 311.327 0.063593 2.24798 0.161142 0.00410549
thread13::scale 662 281.752 0.072002 5.36245 0.425607 0.00371548
thread13::reduce 109 219.894 0.220399 25.7349 2.01737 0.00289975
thread13::broadcast 117 208.254 0.148068 4.66118 1.77995 0.00274625
thread13::cos_sim_grad 156 159.775 0.1948 8.89747 1.0242 0.00210696
thread13::read 336 92.1633 0.040354 2.85077 0.274296 0.00121536
thread13::auc 328 67.5141 0.052528 1.66554 0.205836 0.000890313
thread13::split_selected_rows 6 44.8147 2.82099 16.3935 7.46912 0.000590974
thread13::cos_sim 153 44.3642 0.159415 1.017 0.289962 0.000585033
thread13::cast 159 36.7063 0.014936 3.11617 0.230857 0.000484048
thread13::recv 119 33.1921 0.012474 0.959421 0.278925 0.000437706
thread13::elementwise_sub 323 27.8602 0.014594 1.19528 0.0862545 0.000367394
thread13::elementwise_mul_grad 164 26.3672 0.030007 1.53316 0.160775 0.000347705
thread13::mean_grad 175 16.8838 0.017026 0.989554 0.096479 0.000222648
thread13::square_grad 162 15.5832 0.019925 0.870899 0.0961927 0.000205497
thread13::fill_constant 313 14.4904 0.005611 0.397676 0.0462952 0.000191086
thread13::square 155 14.199 0.011461 0.84051 0.0916065 0.000187243
thread13::elementwise_sub_grad 150 13.9458 0.013962 0.697143 0.0929723 0.000183905
thread13::mean 167 12.3102 0.00904 0.747296 0.0737135 0.000162335
thread13::fill_constant_batch_size_like 159 10.7238 0.013344 0.719343 0.0674453 0.000141416
thread13::elementwise_mul 138 10.1446 0.017084 1.0272 0.0735115 0.000133777
thread13::send 87 4.28645 0.010395 0.258075 0.0492695 5.65256e-05
thread13::create_double_buffer_reader 174 1.16594 0.001253 0.048149 0.00670082 1.53754e-05
thread13::split_byref 19 0.938318 0.030328 0.092657 0.0493852 1.23737e-05
thread12::fetch_barrier 2 16885.4 635.451 16249.9 8442.68 0.37073
thread12::sequence_conv_grad 1289 5043.84 1.62472 17.5516 3.91299 0.110741
thread12::batch_norm_grad 631 3625.86 2.19348 18.6965 5.74621 0.0796084
thread12::mul_grad 621 3200.81 1.67376 26.0158 5.15429 0.0702762
thread12::batch_norm 647 2692.55 1.77723 13.5355 4.16158 0.0591168
thread12::sequence_conv 1331 2635.84 0.853006 24.9275 1.98035 0.0578719
thread12::mul 659 1826.61 0.792017 14.4572 2.77179 0.0401046
thread12::sum 876 1471.05 0.072411 11.9747 1.67928 0.0322979
thread12::concat 1149 1087.79 0.020338 27.0509 0.946724 0.0238831
thread12::elementwise_add 2155 1018.06 0.014548 6.62965 0.472416 0.0223522
thread12::elementwise_add_grad 2035 868.054 0.024279 6.99659 0.426562 0.0190588
thread12::lookup_table 1743 760.167 0.022344 6.1278 0.436125 0.01669
thread12::sequence_pool 1261 707.588 0.218216 23.7041 0.561132 0.0155356
thread12::sequence_pool_grad 1287 486.82 0.061934 6.36729 0.378259 0.0106885
thread12::concat_grad 920 483.243 0.104707 4.34066 0.525264 0.0106099
thread12::lookup_table_grad 1681 451.957 0.02202 8.59856 0.268862 0.00992304
thread12::tanh 1974 413.332 0.072926 0.641173 0.209388 0.00907501
thread12::scale 643 323.11 0.070393 7.86351 0.502504 0.00709412
thread12::tanh_grad 1873 298.325 0.061725 1.63261 0.159277 0.00654994
thread12::broadcast 114 209.346 0.141323 5.51092 1.83637 0.00459635
thread12::send_barrier 3 202.201 36.2935 103.97 67.4004 0.00443947
thread12::reduce 109 197.503 0.179334 19.7022 1.81196 0.00433632
thread12::cos_sim_grad 161 168.809 0.195283 7.32996 1.0485 0.00370632
thread12::read 292 87.9797 0.036032 1.71051 0.3013 0.00193166
thread12::auc 309 53.3957 0.046257 1.29749 0.172802 0.00117234
thread12::split_selected_rows 6 48.0049 2.75862 11.8476 8.00081 0.00105398
thread12::cast 163 47.7612 0.014649 3.60712 0.293014 0.00104863
thread12::cos_sim 154 46.6308 0.156555 1.12942 0.302798 0.00102381
thread12::recv 112 35.0294 0.013687 1.02095 0.312762 0.000769096
thread12::elementwise_mul_grad 159 26.4996 0.030099 1.02092 0.166664 0.000581818
thread12::elementwise_sub 317 25.4204 0.012162 0.973159 0.0801906 0.000558124
thread12::elementwise_sub_grad 172 15.3915 0.013948 1.04276 0.0894856 0.000337932
thread12::square_grad 168 15.196 0.020816 0.89669 0.0904524 0.000333639
thread12::elementwise_mul 200 13.8375 0.01605 0.75418 0.0691876 0.000303813
thread12::mean 148 13.4655 0.009911 0.852742 0.0909833 0.000295646
thread12::square 153 13.1179 0.01156 0.657972 0.085738 0.000288013
thread12::fill_constant 303 12.8033 0.004943 0.412392 0.042255 0.000281105
thread12::mean_grad 146 12.1727 0.01971 0.490179 0.083375 0.000267262
thread12::fill_constant_batch_size_like 140 10.6978 0.011406 0.456522 0.0764132 0.000234879
thread12::send 124 8.40225 0.011551 0.461244 0.0677601 0.000184477
thread12::create_double_buffer_reader 163 1.15944 0.002091 0.078007 0.00711312 2.54563e-05
thread12::split_byref 17 1.02015 0.025098 0.107562 0.0600089 2.23982e-05
thread11::fetch_barrier 4 45568.5 809.808 16084.1 11392.1 0.615
thread11::sequence_conv_grad 1249 4859.05 1.62629 23.4927 3.89035 0.0655786
thread11::mul_grad 676 3544.63 1.65947 29.132 5.24354 0.047839
thread11::batch_norm_grad 601 3429.46 2.19645 21.8815 5.70626 0.0462846
thread11::batch_norm 605 2733.86 1.79092 14.2844 4.51877 0.0368966
thread11::sequence_conv 1250 2634.88 0.820235 25.1457 2.1079 0.0355608
thread11::mul 652 1895.02 0.790145 19.4952 2.90647 0.0255755
thread11::sum 973 1487.5 0.067386 7.35649 1.52877 0.0200755
thread11::concat 1130 1112.87 0.019077 44.3501 0.984837 0.0150194
thread11::elementwise_add 2012 965.653 0.017221 8.25071 0.479947 0.0130326
thread11::elementwise_add_grad 2096 873.467 0.023211 8.03383 0.41673 0.0117885
thread11::lookup_table 1711 752.819 0.022836 5.98346 0.439988 0.0101602
thread11::sequence_pool 1260 711.894 0.218566 23.7061 0.564995 0.00960784
thread11::lookup_table_grad 1761 484.998 0.021475 5.76734 0.275411 0.00654562
thread11::sequence_pool_grad 1251 480.758 0.055072 5.17188 0.384299 0.0064884
thread11::concat_grad 947 469.298 0.104643 3.91665 0.495562 0.00633372
thread11::tanh 1819 388.44 0.077048 0.654295 0.213546 0.00524246
thread11::scale 653 303.048 0.064814 7.57688 0.464086 0.00408999
thread11::tanh_grad 1844 291.265 0.062156 0.667186 0.157953 0.00393096
thread11::broadcast 122 211.607 0.139496 5.65774 1.73448 0.00285588
thread11::reduce 118 184.836 0.145929 16.8689 1.56641 0.00249458
thread11::cos_sim_grad 177 179.699 0.192542 6.60584 1.01525 0.00242525
thread11::split_selected_rows 12 99.3336 2.99079 16.9102 8.2778 0.00134062
thread11::read 338 81.4341 0.039677 1.4443 0.240929 0.00109905
thread11::auc 325 56.8271 0.045439 1.29111 0.174852 0.000766948
thread11::cos_sim 186 56.6724 0.152497 1.18111 0.30469 0.00076486
thread11::cast 156 37.3727 0.011853 2.73271 0.239568 0.000504388
thread11::recv 115 31.7564 0.01835 0.979048 0.276143 0.00042859
thread11::elementwise_sub 327 27.364 0.015852 0.983325 0.0836819 0.000369309
thread11::elementwise_mul_grad 130 19.6901 0.028662 0.759218 0.151462 0.000265741
thread11::mean_grad 183 17.8197 0.019616 1.06529 0.0973755 0.000240498
thread11::square_grad 159 15.6391 0.018209 1.05876 0.0983593 0.000211068
thread11::fill_constant_batch_size_like 168 14.6429 0.012591 0.982678 0.0871603 0.000197624
thread11::mean 161 14.2395 0.00812 0.813723 0.0884443 0.000192179
thread11::fill_constant 307 14.0419 0.006318 0.442073 0.0457391 0.000189512
thread11::elementwise_sub_grad 145 12.7879 0.016061 1.30061 0.0881922 0.000172587
thread11::square 152 10.964 0.011616 1.29577 0.0721316 0.000147972
thread11::elementwise_mul 129 10.8097 0.017406 0.873579 0.083796 0.000145889
thread11::send 132 8.37255 0.010064 0.886904 0.0634284 0.000112997
thread11::create_double_buffer_reader 156 1.11368 0.002302 0.076595 0.00713897 1.50304e-05
thread11::split_byref 16 0.668356 0.018892 0.075453 0.0417723 9.02025e-06
thread10::fetch_barrier 3 27905.3 609.927 13970.1 9301.77 0.495614
thread10::sequence_conv_grad 1242 5001.15 1.65689 21.3562 4.02669 0.0888233
thread10::batch_norm_grad 628 3805.69 2.19921 19.6471 6.06002 0.0675912
thread10::mul_grad 578 3159.19 1.66225 25.5194 5.46573 0.056109
thread10::batch_norm 606 2858.39 1.81015 14.8521 4.71681 0.0507666
thread10::sequence_conv 1239 2675.24 0.879054 25.6072 2.15919 0.0475137
thread10::mul 592 1756.27 0.793066 15.5724 2.96667 0.0311923
thread10::sum 951 1405.63 0.075799 8.31579 1.47805 0.0249648
thread10::concat 1147 1049.77 0.018676 26.1 0.915233 0.0186445
thread10::elementwise_add 1940 956.783 0.015292 9.07805 0.493187 0.016993
thread10::elementwise_add_grad 1961 803.534 0.021306 8.71669 0.409757 0.0142712
thread10::lookup_table 1729 736.748 0.020991 6.00964 0.426112 0.0130851
thread10::sequence_pool 1247 712.559 0.215601 23.461 0.571418 0.0126554
thread10::sequence_pool_grad 1280 506.368 0.056144 5.84544 0.3956 0.00899337
thread10::concat_grad 942 492.118 0.120531 4.16522 0.522418 0.00874028
thread10::lookup_table_grad 1752 475.525 0.02363 6.50643 0.271418 0.00844559
thread10::tanh 1804 387.756 0.075507 0.61062 0.214942 0.00688676
thread10::tanh_grad 1836 299.045 0.055996 1.64046 0.162878 0.0053112
thread10::scale 638 276.89 0.058454 5.68637 0.433997 0.00491773
thread10::broadcast 112 209.949 0.17165 6.21955 1.87454 0.00372881
thread10::reduce 123 179.454 0.211193 15.9261 1.45897 0.00318719
thread10::cos_sim_grad 151 158.795 0.206391 8.27171 1.05162 0.00282028
thread10::read 308 84.8472 0.038914 1.55864 0.275478 0.00150693
thread10::auc 335 60.318 0.056311 0.955638 0.180054 0.00107128
thread10::cos_sim 151 51.4833 0.148379 1.88589 0.340949 0.000914373
thread10::split_selected_rows 6 47.7025 3.3271 12.2915 7.95042 0.000847223
thread10::cast 149 38.5146 0.016053 2.91507 0.258488 0.000684042
thread10::recv 120 33.007 0.013064 0.938893 0.275058 0.000586223
thread10::elementwise_sub 328 29.2496 0.015662 1.02153 0.0891758 0.00051949
thread10::elementwise_mul_grad 140 28.4628 0.027844 1.46832 0.203306 0.000505515
thread10::elementwise_sub_grad 159 16.1476 0.016939 1.02455 0.101558 0.000286791
thread10::mean 149 15.4905 0.011255 1.31164 0.103963 0.000275119
thread10::square_grad 151 15.0416 0.015333 0.576152 0.0996136 0.000267148
thread10::fill_constant 335 14.3254 0.005004 0.397451 0.0427624 0.000254427
thread10::fill_constant_batch_size_like 154 13.6294 0.010946 1.24553 0.0885026 0.000242066
thread10::mean_grad 156 13.5895 0.020444 0.974843 0.0871119 0.000241356
thread10::elementwise_mul 180 12.4656 0.017426 0.382112 0.0692535 0.000221396
thread10::square 135 9.55361 0.011617 0.557194 0.0707674 0.000169677
thread10::send 123 6.68475 0.012929 0.316239 0.0543476 0.000118725
thread10::create_double_buffer_reader 152 1.0503 0.002139 0.069017 0.00690989 1.8654e-05
thread10::split_byref 16 0.811218 0.026237 0.129718 0.0507011 1.44077e-05
thread9::fetch_barrier 4 45394 705.143 15411.9 11348.5 0.613797
thread9::sequence_conv_grad 1277 5035.46 1.65335 14.9289 3.94319 0.0680871
thread9::batch_norm_grad 630 3645.24 2.1917 17.6404 5.7861 0.0492893
thread9::mul_grad 643 3262.63 1.68102 24.4812 5.07407 0.0441157
thread9::batch_norm 638 2770.15 1.77411 13.8411 4.34193 0.0374567
thread9::sequence_conv 1265 2659.35 0.858854 24.8404 2.10226 0.0359586
thread9::mul 644 1855.66 0.790586 16.1129 2.88146 0.0250914
thread9::sum 936 1511.9 0.075685 9.80199 1.61528 0.0204433
thread9::concat 1122 1026.34 0.021173 26.8104 0.914742 0.0138777
thread9::elementwise_add 2056 983.16 0.014861 6.77585 0.478191 0.0132938
thread9::elementwise_add_grad 1977 803.739 0.020175 6.7102 0.406545 0.0108678
thread9::lookup_table 1756 750.176 0.024611 6.20442 0.427207 0.0101435
thread9::sequence_pool 1287 710.688 0.216884 23.5378 0.552205 0.0096096
thread9::lookup_table_grad 1770 465.588 0.023209 5.73702 0.263044 0.00629547
thread9::sequence_pool_grad 1250 464.589 0.05538 4.98275 0.371671 0.00628195
thread9::concat_grad 933 460.781 0.112279 3.77117 0.493871 0.00623047
thread9::tanh 1853 397.275 0.075082 0.654249 0.214396 0.00537177
thread9::tanh_grad 1907 309.451 0.063233 0.582961 0.162271 0.00418426
thread9::scale 646 296.676 0.065449 5.88007 0.45925 0.00401151
thread9::broadcast 123 203.166 0.156994 7.79936 1.65176 0.00274712
thread9::cos_sim_grad 160 158.428 0.199243 6.87161 0.990176 0.00214219
thread9::reduce 103 156.828 0.173291 23.1737 1.5226 0.00212056
thread9::split_selected_rows 12 125.177 2.77551 23.8891 10.4314 0.00169259
thread9::read 268 92.6759 0.044755 2.12933 0.345806 0.00125312
thread9::cos_sim 175 54.7653 0.14472 0.883277 0.312944 0.000740511
thread9::auc 289 50.1728 0.059346 1.11383 0.173608 0.000678414
thread9::send_barrier 2 49.9712 19.3172 30.654 24.9856 0.000675688
thread9::cast 175 42.9469 0.01203 2.63408 0.245411 0.000580708
thread9::recv 116 41.113 0.018141 0.967945 0.354423 0.000555912
thread9::elementwise_mul_grad 152 25.8188 0.028932 1.04823 0.169861 0.00034911
thread9::elementwise_sub 298 24.9781 0.013497 1.57411 0.0838192 0.000337743
thread9::elementwise_sub_grad 175 21.7677 0.015277 1.01042 0.124387 0.000294332
thread9::square_grad 164 16.6153 0.016859 0.830941 0.101313 0.000224665
thread9::fill_constant 335 15.9051 0.00612 0.720436 0.047478 0.000215062
thread9::mean_grad 142 14.9088 0.019454 0.770203 0.104991 0.00020159
thread9::mean 143 13.9365 0.009414 0.672979 0.0974583 0.000188443
thread9::elementwise_mul 165 13.0013 0.018435 0.727815 0.078796 0.000175798
thread9::square 142 12.9369 0.008229 0.664625 0.0911051 0.000174927
thread9::fill_constant_batch_size_like 150 8.35739 0.011463 0.333245 0.055716 0.000113005
thread9::send 127 8.26406 0.01099 0.401593 0.0650714 0.000111743
thread9::create_double_buffer_reader 146 0.940069 0.001758 0.051403 0.00643883 1.27112e-05
thread9::split_byref 13 0.506866 0.026818 0.057698 0.0389897 6.85361e-06
thread8::sequence_conv_grad 1301 5205.06 1.58755 18.2666 4.00081 0.176824
thread8::batch_norm_grad 626 3725.31 2.18648 22.9447 5.95097 0.126554
thread8::mul_grad 605 3073.84 1.64432 26.3762 5.08072 0.104423
thread8::batch_norm 626 2840.88 1.78994 15.6265 4.53814 0.0965088
thread8::sequence_conv 1207 2652.61 0.772446 23.959 2.19768 0.090113
thread8::mul 640 1767.07 0.796368 16.1711 2.76104 0.0600299
thread8::sum 905 1471.25 0.079453 9.65826 1.62569 0.0499806
thread8::concat 1091 1144.54 0.016723 31.7004 1.04907 0.0388817
thread8::elementwise_add 1964 943.317 0.015429 7.22925 0.480304 0.0320459
thread8::elementwise_add_grad 2016 877.401 0.023014 7.81769 0.435219 0.0298066
thread8::lookup_table 1766 767.377 0.024078 6.37176 0.434528 0.0260689
thread8::fetch_barrier 1 739.621 739.621 739.621 739.621 0.025126
thread8::sequence_pool 1274 719.277 0.214302 23.5274 0.564582 0.0244349
thread8::concat_grad 917 483.764 0.125003 4.60696 0.527551 0.0164342
thread8::sequence_pool_grad 1251 473.204 0.061339 5.74253 0.37826 0.0160754
thread8::lookup_table_grad 1619 425.47 0.022791 5.93199 0.262798 0.0144538
thread8::tanh 1811 391.202 0.078513 0.682612 0.216014 0.0132897
thread8::tanh_grad 1806 304.026 0.059536 2.72275 0.168342 0.0103282
thread8::scale 569 248.02 0.072389 4.04737 0.435888 0.00842563
thread8::send_barrier 4 239.001 21.2392 117.19 59.7503 0.00811923
thread8::broadcast 112 206.607 0.171346 7.94773 1.8447 0.00701874
thread8::cos_sim_grad 155 173.788 0.20167 7.10267 1.12121 0.00590383
thread8::reduce 98 85.2461 0.182759 10.1377 0.869858 0.00289594
thread8::read 324 83.3764 0.045863 1.36234 0.257335 0.00283242
thread8::auc 324 58.7427 0.045017 1.17139 0.181305 0.00199558
thread8::split_selected_rows 8 50.7198 3.2612 9.63971 6.33997 0.00172303
thread8::cos_sim 153 47.9601 0.143878 1.33116 0.313465 0.00162928
thread8::cast 156 37.1558 0.015545 2.38322 0.238178 0.00126224
thread8::recv 116 32.2342 0.016892 0.898582 0.277881 0.00109504
thread8::elementwise_sub 333 29.7448 0.014535 1.12614 0.0893236 0.00101047
thread8::elementwise_mul_grad 146 25.9007 0.030991 1.69174 0.177402 0.000879886
thread8::mean_grad 166 17.4795 0.016978 0.882141 0.105298 0.000593806
thread8::square_grad 176 15.9734 0.018324 0.849144 0.0907581 0.000542641
thread8::fill_constant_batch_size_like 162 13.0578 0.01306 0.601151 0.0806039 0.000443594
thread8::square 160 12.6767 0.012978 0.680612 0.0792294 0.000430647
thread8::fill_constant 293 12.0972 0.006192 0.461571 0.0412874 0.00041096
thread8::elementwise_mul 170 11.9332 0.016044 0.571521 0.0701953 0.000405389
thread8::elementwise_sub_grad 140 11.1762 0.015815 0.466704 0.0798301 0.000379673
thread8::mean 140 10.103 0.006696 0.535193 0.072164 0.000343213
thread8::send 102 6.3699 0.015137 0.475607 0.06245 0.000216395
thread8::create_double_buffer_reader 149 1.06017 0.001769 0.035788 0.00711526 3.60157e-05
thread8::split_byref 18 0.819699 0.02762 0.120458 0.0455388 2.78464e-05
thread7::fetch_barrier 3 44096.8 13420.1 15572.2 14698.9 0.605917
thread7::sequence_conv_grad 1314 4888.67 1.62081 21.8524 3.72045 0.0671733
thread7::batch_norm_grad 671 3509.46 2.18962 18.5318 5.2302 0.0482222
thread7::mul_grad 706 3425.51 1.63792 27.8536 4.85199 0.0470686
thread7::batch_norm 667 2718.2 1.76818 11.7029 4.07526 0.0373497
thread7::sequence_conv 1353 2671.31 0.826143 22.8835 1.97436 0.0367054
thread7::mul 706 1758.76 0.791725 15.2376 2.49117 0.0241665
thread7::sum 1057 1439.99 0.071566 9.22598 1.36234 0.0197863
thread7::concat 1120 1041.75 0.019332 16.0628 0.930138 0.0143143
thread7::elementwise_add 2191 1036.65 0.016752 7.22537 0.47314 0.0142442
thread7::elementwise_add_grad 2175 876.413 0.021103 9.96765 0.402949 0.0120425
thread7::lookup_table 1689 715.541 0.024653 5.94807 0.423648 0.00983197
thread7::sequence_pool 1294 714.91 0.210386 23.4828 0.55248 0.00982329
thread7::sequence_pool_grad 1378 524.019 0.054286 5.59645 0.380275 0.00720034
thread7::concat_grad 1024 496.354 0.110917 5.25344 0.484721 0.00682021
thread7::lookup_table_grad 1887 471.342 0.01907 6.03824 0.249784 0.00647653
thread7::tanh 2049 425.456 0.07619 0.658785 0.207641 0.00584602
thread7::tanh_grad 1990 314.857 0.063509 0.86146 0.15822 0.00432633
thread7::scale 639 314.81 0.07197 4.44925 0.49266 0.00432568
thread7::reduce 134 224.631 0.225641 18.0432 1.67635 0.00308656
thread7::broadcast 117 214.48 0.1547 5.36566 1.83316 0.00294708
thread7::send_barrier 4 176.583 26.8622 70.0072 44.1458 0.00242636
thread7::cos_sim_grad 156 175.374 0.176672 9.27743 1.12419 0.00240975
thread7::split_selected_rows 13 110.802 5.23098 13.4726 8.52321 0.00152248
thread7::read 308 86.0268 0.031585 1.89389 0.279308 0.00118206
thread7::cos_sim 173 55.4263 0.14366 0.903815 0.320383 0.000761591
thread7::auc 309 54.5062 0.057809 1.14647 0.176395 0.000748948
thread7::cast 150 37.3937 0.012154 2.47006 0.249291 0.000513812
thread7::recv 109 32.7459 0.016907 0.924388 0.300421 0.000449949
thread7::elementwise_sub 342 29.0007 0.01445 0.946714 0.0847974 0.000398487
thread7::elementwise_mul_grad 144 23.4808 0.028543 0.771677 0.163061 0.00032264
thread7::square_grad 172 15.9491 0.017878 1.57868 0.0927271 0.00021915
thread7::mean 150 14.0158 0.008844 1.43049 0.0934387 0.000192586
thread7::mean_grad 162 13.9955 0.021047 0.626142 0.0863921 0.000192307
thread7::square 162 13.5921 0.010093 1.42613 0.0839019 0.000186764
thread7::fill_constant 301 13.4386 0.005487 0.59923 0.0446464 0.000184654
thread7::fill_constant_batch_size_like 176 13.239 0.013404 1.08081 0.0752217 0.000181912
thread7::elementwise_sub_grad 135 12.7983 0.016871 1.00232 0.0948023 0.000175857
thread7::elementwise_mul 161 10.0885 0.018115 0.544687 0.0626613 0.000138622
thread7::send 120 6.49772 0.012681 0.400394 0.0541477 8.92827e-05
thread7::create_double_buffer_reader 162 1.23463 0.00179 0.052463 0.00762119 1.69646e-05
thread7::split_byref 21 0.874182 0.02118 0.089751 0.0416277 1.20118e-05
thread6::sequence_conv_grad 1274 5016.19 1.69912 21.5263 3.93735 0.175358
thread6::batch_norm_grad 689 3679.94 2.20029 17.6266 5.34099 0.128645
thread6::mul_grad 647 3181.16 1.66386 31.3188 4.91678 0.111208
thread6::batch_norm 660 2805.94 1.78986 11.1254 4.25142 0.0980911
thread6::sequence_conv 1327 2654.02 0.790426 23.3031 2.00002 0.0927804
thread6::mul 662 1825.52 0.796126 19.3243 2.75759 0.0638173
thread6::sum 917 1461.63 0.070714 9.32687 1.59392 0.0510961
thread6::concat 1146 1141.83 0.020326 26.8824 0.996364 0.0399166
thread6::elementwise_add 2136 983.769 0.018764 6.84459 0.460566 0.034391
thread6::elementwise_add_grad 2100 871.283 0.024368 9.12019 0.414897 0.0304586
thread6::lookup_table 1775 732.499 0.02663 7.51795 0.412675 0.025607
thread6::sequence_pool 1278 707.365 0.20992 23.5113 0.553494 0.0247283
thread6::sequence_pool_grad 1260 477.593 0.054759 5.43147 0.379042 0.0166959
thread6::concat_grad 979 475.517 0.111264 3.0443 0.485717 0.0166233
thread6::lookup_table_grad 1823 465.947 0.022302 11.9122 0.255594 0.0162888
thread6::tanh 1957 407.597 0.079513 0.676667 0.208277 0.0142489
thread6::tanh_grad 1966 317.761 0.064561 1.06132 0.161628 0.0111084
thread6::scale 664 275.391 0.064515 5.29579 0.414746 0.00962723
thread6::broadcast 107 196.822 0.296909 6.9091 1.83945 0.00688056
thread6::cos_sim_grad 156 159.725 0.188262 11.4127 1.02388 0.00558373
thread6::reduce 118 132.494 0.176529 16.5025 1.12283 0.00463178
thread6::split_selected_rows 13 114.103 2.75881 13.2467 8.77716 0.00398886
thread6::read 316 80.9112 0.037007 1.45663 0.256048 0.00282852
thread6::send_barrier 2 78.5746 23.9206 54.6539 39.2873 0.00274684
thread6::auc 362 61.1244 0.05714 1.38223 0.168852 0.00213681
thread6::cos_sim 158 49.5619 0.151043 1.06912 0.313683 0.0017326
thread6::cast 161 42.7501 0.01531 3.45233 0.265529 0.00149447
thread6::recv 110 31.2973 0.014093 0.900887 0.284521 0.0010941
thread6::elementwise_sub 346 30.0732 0.015333 0.800613 0.0869168 0.00105131
thread6::elementwise_mul_grad 157 29.1312 0.026985 1.13342 0.185549 0.00101838
thread6::square_grad 161 17.6438 0.018998 1.31782 0.109589 0.000616798
thread6::elementwise_sub_grad 154 16.6417 0.014022 2.03981 0.108063 0.000581767
thread6::mean_grad 145 15.1737 0.015765 1.16953 0.104646 0.000530448
thread6::fill_constant 297 13.9849 0.005605 0.463413 0.0470871 0.000488889
thread6::elementwise_mul 171 12.216 0.016608 0.611339 0.0714388 0.000427053
thread6::square 168 11.4379 0.01178 0.920068 0.068083 0.000399852
thread6::mean 138 10.2905 0.006592 0.765124 0.0745689 0.00035974
thread6::fill_constant_batch_size_like 159 9.70111 0.011959 0.385411 0.0610133 0.000339135
thread6::send 139 8.83712 0.011692 0.4439 0.0635764 0.000308931
thread6::create_double_buffer_reader 176 1.23829 0.001943 0.075787 0.00703576 4.32887e-05
thread6::split_byref 15 0.761707 0.02446 0.082714 0.0507805 2.6628e-05
thread5::fetch_barrier 3 16455.8 691.426 14673.5 5485.28 0.36337
thread5::sequence_conv_grad 1237 4937.97 1.61842 26.7579 3.99189 0.109038
thread5::batch_norm_grad 590 3724.12 2.18379 18.2984 6.31208 0.0822343
thread5::mul_grad 603 3267.95 1.67994 27.6365 5.41948 0.0721612
thread5::batch_norm 605 2895.62 1.90974 14.547 4.78615 0.0639396
thread5::sequence_conv 1226 2653.67 0.862593 23.6912 2.16449 0.058597
thread5::mul 596 1777.84 0.801573 21.5311 2.98295 0.0392573
thread5::sum 966 1472.6 0.066126 11.021 1.52443 0.0325172
thread5::concat 1169 1038.06 0.020236 26.8325 0.887993 0.022922
thread5::elementwise_add 2001 925.52 0.011574 5.71484 0.462529 0.0204369
thread5::elementwise_add_grad 1937 789.523 0.022682 10.3698 0.407601 0.0174338
thread5::lookup_table 1713 734.526 0.026364 9.46655 0.428795 0.0162194
thread5::sequence_pool 1236 704.767 0.223312 23.5786 0.5702 0.0155623
thread5::sequence_pool_grad 1307 511.414 0.060297 7.43059 0.391289 0.0112928
thread5::lookup_table_grad 1775 488.029 0.024588 8.83189 0.274946 0.0107764
thread5::concat_grad 900 464.808 0.120609 7.52225 0.516453 0.0102637
thread5::tanh 1822 400.045 0.081667 0.652092 0.219564 0.0088336
thread5::send_barrier 3 317.32 77.0701 140.63 105.773 0.00700691
thread5::tanh_grad 1892 305.28 0.066369 0.707227 0.161353 0.00674104
thread5::scale 620 259.838 0.071382 4.7153 0.419093 0.00573761
thread5::reduce 130 257.804 0.160515 21.9779 1.98311 0.0056927
thread5::broadcast 114 209.187 0.151482 4.41026 1.83497 0.00461915
thread5::cos_sim_grad 156 163.499 0.19912 9.24948 1.04807 0.0036103
thread5::read 278 96.4237 0.043374 1.61857 0.346848 0.00212918
thread5::split_selected_rows 9 75.5048 3.00244 14.0049 8.38942 0.00166726
thread5::auc 317 56.6127 0.058583 1.11129 0.178589 0.00125009
thread5::cast 181 51.7896 0.015039 2.56399 0.28613 0.00114359
thread5::cos_sim 136 44.156 0.145186 1.22603 0.324677 0.000975032
thread5::recv 110 31.8099 0.017351 0.986029 0.289181 0.000702411
thread5::elementwise_mul_grad 141 26.6379 0.022939 2.19058 0.188922 0.000588206
thread5::elementwise_sub 302 25.2029 0.01587 1.05107 0.0834535 0.000556519
thread5::mean_grad 186 19.264 0.019253 1.35656 0.10357 0.000425379
thread5::mean 169 15.5487 0.00866 2.09921 0.0920039 0.000343338
thread5::elementwise_sub_grad 145 15.3487 0.018084 1.47946 0.105853 0.000338922
thread5::square_grad 152 14.6755 0.017737 0.961119 0.0965495 0.000324058
thread5::fill_constant 310 13.4351 0.005678 0.443868 0.043339 0.000296667
thread5::square 174 12.874 0.010655 1.05055 0.0739884 0.000284277
thread5::fill_constant_batch_size_like 145 12.4461 0.013582 0.69394 0.0858348 0.000274828
thread5::elementwise_mul 146 10.2102 0.018759 0.600416 0.0699331 0.000225457
thread5::send 108 7.65606 0.013508 0.403571 0.0708894 0.000169057
thread5::create_double_buffer_reader 178 1.26606 0.002086 0.051236 0.0071127 2.79565e-05
thread5::split_byref 14 0.671764 0.029159 0.091992 0.0479831 1.48336e-05
thread4::fetch_barrier 6 45005.3 779.496 15068.6 7500.89 0.610506
thread4::sequence_conv_grad 1226 4915.68 1.57775 19.5195 4.00953 0.0666821
thread4::batch_norm_grad 627 3751.64 2.19081 18.5176 5.98348 0.0508918
thread4::mul_grad 604 3135.94 1.65237 27.2786 5.19196 0.0425397
thread4::batch_norm 623 2896.12 1.80976 13.0359 4.64866 0.0392863
thread4::sequence_conv 1251 2684.37 0.856778 24.7292 2.14578 0.0364139
thread4::mul 611 1757.85 0.796196 14.0086 2.87701 0.0238456
thread4::sum 927 1537.3 0.081091 8.60128 1.65836 0.0208538
thread4::concat 1120 1022.1 0.019827 24.784 0.912589 0.013865
thread4::elementwise_add 1978 925.726 0.013907 6.8075 0.468011 0.0125576
thread4::elementwise_add_grad 2129 857.374 0.025906 8.93441 0.402712 0.0116304
thread4::lookup_table 1762 735.375 0.022126 5.52929 0.417353 0.00997551
thread4::sequence_pool 1255 705.521 0.217575 23.5603 0.562169 0.00957053
thread4::concat_grad 975 487.999 0.116203 4.08285 0.500512 0.0066198
thread4::sequence_pool_grad 1305 481.749 0.057428 7.13429 0.369156 0.00653502
thread4::lookup_table_grad 1753 474.013 0.020694 9.24527 0.270401 0.00643008
thread4::tanh 1914 415.569 0.074146 0.646064 0.217121 0.00563727
thread4::tanh_grad 1934 316.161 0.062475 0.627872 0.163475 0.00428878
thread4::send_barrier 4 306.483 28.3505 96.4781 76.6208 0.0041575
thread4::scale 555 230.302 0.06489 6.68327 0.414959 0.00312409
thread4::broadcast 124 221.34 0.132456 7.9524 1.785 0.00300252
thread4::reduce 112 198.636 0.203311 28.4731 1.77353 0.00269453
thread4::cos_sim_grad 176 166.467 0.19225 9.87159 0.945835 0.00225816
thread4::read 332 86.8419 0.03549 1.43534 0.261572 0.00117803
thread4::split_selected_rows 7 62.3308 5.8013 12.6304 8.9044 0.000845529
thread4::auc 336 61.6384 0.051676 0.847169 0.183448 0.000836137
thread4::cos_sim 144 45.9037 0.150551 1.01853 0.318775 0.000622692
thread4::cast 157 38.061 0.01436 2.25206 0.242427 0.000516305
thread4::recv 127 32.1108 0.011583 1.07391 0.252841 0.000435589
thread4::elementwise_mul_grad 148 24.9885 0.028817 1.14045 0.168841 0.000338973
thread4::elementwise_sub 296 21.8619 0.015304 0.683005 0.0738576 0.00029656
thread4::square_grad 169 19.2263 0.01944 1.05134 0.113765 0.000260808
thread4::elementwise_sub_grad 151 15.3715 0.01584 1.16456 0.101798 0.000208518
thread4::fill_constant 343 14.612 0.004554 0.606437 0.0426005 0.000198214
thread4::mean_grad 137 13.3247 0.019492 0.649568 0.0972609 0.000180753
thread4::mean 166 12.6241 0.006911 0.532463 0.076049 0.000171249
thread4::fill_constant_batch_size_like 145 12.1787 0.011969 1.01887 0.0839912 0.000165207
thread4::elementwise_mul 134 8.96961 0.016795 0.718967 0.0669374 0.000121674
thread4::square 122 8.92471 0.009809 1.0685 0.0731534 0.000121065
thread4::send 105 8.51385 0.013831 0.669003 0.0810843 0.000115492
thread4::create_double_buffer_reader 170 1.06918 0.001913 0.064567 0.00628926 1.45036e-05
thread4::split_byref 13 0.509717 0.026731 0.052322 0.039209 6.91441e-06
thread3::fetch_barrier 3 16126.8 1027.06 13948 5375.6 0.360091
thread3::sequence_conv_grad 1283 4998.19 1.63697 18.7378 3.89571 0.111603
thread3::batch_norm_grad 633 3737.96 2.19388 19.8902 5.90515 0.0834639
thread3::mul_grad 645 3273.08 1.64824 36.1187 5.07454 0.0730837
thread3::batch_norm 625 2804.44 1.77273 14.3121 4.4871 0.0626196
thread3::sequence_conv 1258 2671.06 0.873218 23.3088 2.12326 0.0596414
thread3::mul 653 1845.6 0.792872 17.2394 2.82634 0.0412099
thread3::sum 881 1332.22 0.07705 11.7231 1.51217 0.0297469
thread3::concat 1141 1065.44 0.019129 26.9339 0.933774 0.0237899
thread3::elementwise_add 2024 946.873 0.016307 11.1211 0.467823 0.0211425
thread3::elementwise_add_grad 2030 859.024 0.020897 10.9512 0.423165 0.0191809
thread3::lookup_table 1746 763.35 0.022715 8.31356 0.437199 0.0170446
thread3::sequence_pool 1265 707.121 0.219961 23.6759 0.558989 0.0157891
thread3::concat_grad 927 467.381 0.114297 4.41937 0.504186 0.010436
thread3::sequence_pool_grad 1258 457.863 0.047415 6.07234 0.363961 0.0102235
thread3::lookup_table_grad 1710 455.806 0.023423 6.99781 0.266553 0.0101776
thread3::tanh 1840 394.456 0.079123 0.623015 0.214378 0.0088077
thread3::tanh_grad 1859 300.715 0.064753 0.619348 0.161762 0.00671458
thread3::reduce 129 280.761 0.213505 23.4755 2.17644 0.00626903
thread3::scale 597 266.151 0.067507 5.75145 0.445814 0.00594281
thread3::broadcast 112 197.45 0.149688 5.25541 1.76295 0.00440881
thread3::cos_sim_grad 161 176.614 0.217214 5.08321 1.09698 0.00394357
thread3::send_barrier 4 157.596 19.5997 63.3453 39.3989 0.00351891
thread3::read 292 88.6213 0.041031 1.11845 0.303498 0.0019788
thread3::split_selected_rows 8 60.6079 4.30495 14.3818 7.57599 0.0013533
thread3::auc 305 56.2668 0.063798 0.97987 0.184481 0.00125637
thread3::cos_sim 148 51.3628 0.150766 1.20545 0.347046 0.00114687
thread3::cast 131 35.8008 0.012187 4.74631 0.273289 0.000799388
thread3::recv 118 34.0804 0.017682 0.951278 0.288817 0.000760972
thread3::elementwise_sub 315 27.5182 0.017585 0.949066 0.0873595 0.000614447
thread3::elementwise_mul_grad 147 25.5643 0.028053 1.36328 0.173906 0.000570818
thread3::mean_grad 174 18.452 0.017932 1.32225 0.106046 0.00041201
thread3::square_grad 174 17.2508 0.013737 0.921739 0.0991426 0.000385189
thread3::fill_constant 317 13.9169 0.005746 0.436266 0.043902 0.000310748
thread3::elementwise_sub_grad 162 13.8559 0.016102 1.17051 0.08553 0.000309384
thread3::fill_constant_batch_size_like 145 13.8456 0.010824 0.910915 0.0954871 0.000309155
thread3::mean 157 13.8029 0.007901 1.02165 0.0879166 0.000308201
thread3::elementwise_mul 151 10.0779 0.017571 0.507239 0.066741 0.000225027
thread3::square 149 9.44395 0.011693 0.605248 0.0633822 0.000210871
thread3::send 104 6.91592 0.010207 0.38335 0.0664992 0.000154424
thread3::create_double_buffer_reader 167 1.11548 0.001697 0.03213 0.00667954 2.49073e-05
thread3::split_byref 21 0.873184 0.017677 0.086735 0.0415802 1.94971e-05
thread2::sequence_conv_grad 1233 4987.19 1.60932 21.6324 4.04476 0.173628
thread2::batch_norm_grad 628 3814.35 2.18203 20.0726 6.0738 0.132796
thread2::mul_grad 609 3181.56 1.65522 30.5159 5.22423 0.110765
thread2::batch_norm 629 2844.79 1.81178 12.4858 4.52272 0.0990411
thread2::sequence_conv 1205 2665.86 0.845448 23.2254 2.21233 0.0928114
thread2::mul 563 1838.72 0.792125 17.3844 3.26594 0.0640149
thread2::sum 896 1450.49 0.070377 7.98409 1.61885 0.0504987
thread2::concat 1112 1068.94 0.018571 27.9796 0.961276 0.0372149
thread2::elementwise_add 1959 936.189 0.01578 5.40256 0.477891 0.0325933
thread2::elementwise_add_grad 1991 820.345 0.019232 5.49352 0.412026 0.0285602
thread2::lookup_table 1773 733.559 0.020635 6.09997 0.413739 0.0255388
thread2::sequence_pool 1260 716.089 0.217793 23.5135 0.568325 0.0249305
thread2::concat_grad 901 458.076 0.107967 4.32036 0.508409 0.0159479
thread2::lookup_table_grad 1684 457.041 0.020155 6.92962 0.271402 0.0159118
thread2::sequence_pool_grad 1147 451.831 0.055839 6.4527 0.393924 0.0157304
thread2::tanh 1875 402.74 0.0818 0.67296 0.214794 0.0140213
thread2::tanh_grad 1903 305.981 0.055397 1.0031 0.160789 0.0106527
thread2::scale 614 280.607 0.067163 6.81103 0.457015 0.00976929
thread2::send_barrier 3 242.469 43.4689 118.924 80.8231 0.00844154
thread2::broadcast 111 202.038 0.175994 7.94383 1.82016 0.00703391
thread2::cos_sim_grad 156 178.64 0.196283 7.42524 1.14513 0.00621934
thread2::reduce 125 166.18 0.160538 16.3606 1.32944 0.00578554
thread2::read 292 93.7926 0.035045 1.68874 0.321207 0.00326538
thread2::split_selected_rows 6 61.8458 6.03627 14.0812 10.3076 0.00215315
thread2::auc 323 56.8357 0.051584 1.31947 0.175962 0.00197873
thread2::cos_sim 163 50.7105 0.150525 1.0188 0.311107 0.00176548
thread2::cast 175 46.6754 0.014429 3.46286 0.266717 0.001625
thread2::recv 114 39.3734 0.016857 0.980981 0.345381 0.00137078
thread2::elementwise_mul_grad 162 34.1649 0.026424 1.66564 0.210894 0.00118944
thread2::elementwise_sub 283 21.5788 0.012652 0.743158 0.0762502 0.000751264
thread2::elementwise_sub_grad 156 16.6028 0.013518 1.35983 0.106428 0.000578023
thread2::mean_grad 158 16.416 0.017877 1.7871 0.103899 0.000571521
thread2::square_grad 160 14.6364 0.017292 0.685256 0.0914774 0.000509564
thread2::mean 163 13.0237 0.008131 0.850193 0.0798999 0.000453418
thread2::square 165 12.8957 0.012124 1.33776 0.0781556 0.000448961
thread2::fill_constant 299 12.8692 0.005556 0.737359 0.0430407 0.000448039
thread2::fill_constant_batch_size_like 149 9.78944 0.014137 0.506359 0.0657009 0.000340818
thread2::elementwise_mul 145 9.42746 0.016499 0.713661 0.0650169 0.000328216
thread2::send 119 6.96275 0.00978 0.507824 0.0585105 0.000242407
thread2::create_double_buffer_reader 146 1.29025 0.001719 0.108989 0.00883736 4.492e-05
thread2::split_byref 15 0.789689 0.024626 0.102735 0.0526459 2.74929e-05
thread1::fetch_barrier 2 28710.5 13523.7 15186.8 14355.2 0.500549
thread1::sequence_conv_grad 1298 5049.28 1.67225 18.5906 3.89005 0.088031
thread1::batch_norm_grad 633 3603.08 2.1929 20.4361 5.69207 0.0628174
thread1::mul_grad 639 3219.43 1.665 25.368 5.03823 0.0561287
thread1::batch_norm 639 2752.27 1.79805 12.0923 4.30715 0.047984
thread1::sequence_conv 1295 2692.65 0.820025 23.2822 2.07927 0.0469447
thread1::mul 667 1860.42 0.789184 16.527 2.78923 0.0324352
thread1::sum 981 1439.59 0.069277 10.8553 1.46748 0.0250984
thread1::concat 1123 1047.95 0.02064 29.6785 0.933172 0.0182704
thread1::elementwise_add 2063 995.579 0.016068 7.57196 0.482588 0.0173573
thread1::elementwise_add_grad 2130 899.576 0.021646 10.8468 0.422336 0.0156835
thread1::lookup_table 1708 735.717 0.021545 6.21887 0.430747 0.0128267
thread1::sequence_pool 1267 705.242 0.213195 23.5736 0.556623 0.0122954
thread1::concat_grad 995 494.902 0.108628 3.56371 0.497389 0.00862831
thread1::lookup_table_grad 1907 492.522 0.02007 6.10832 0.258271 0.00858682
thread1::sequence_pool_grad 1330 471.118 0.062115 4.94855 0.354224 0.00821364
thread1::tanh 1946 416.831 0.07861 0.72679 0.214199 0.00726718
thread1::scale 614 305.592 0.07229 6.47611 0.497707 0.00532781
thread1::tanh_grad 1919 303.954 0.060346 0.673147 0.158392 0.00529924
thread1::broadcast 111 202.156 0.157805 5.60393 1.82123 0.00352447
thread1::reduce 109 194.586 0.22739 20.6498 1.78519 0.00339248
thread1::cos_sim_grad 153 162.757 0.205553 7.67509 1.06377 0.00283757
thread1::send_barrier 3 113.072 26.7542 50.7378 37.6905 0.00197133
thread1::read 332 83.9012 0.039914 1.76094 0.252714 0.00146276
thread1::split_selected_rows 6 60.9944 7.87294 16.6097 10.1657 0.0010634
thread1::auc 283 52.5627 0.049418 1.24951 0.185734 0.000916397
thread1::cos_sim 157 50.2024 0.142193 1.23465 0.319761 0.000875247
thread1::cast 150 33.2348 0.014635 2.93171 0.221565 0.000579427
thread1::recv 110 33.1227 0.016076 0.971415 0.301115 0.000577473
thread1::elementwise_sub 322 25.8503 0.012545 1.05623 0.0802804 0.000450684
thread1::elementwise_mul_grad 147 23.4623 0.032249 0.935644 0.159607 0.00040905
thread1::mean_grad 172 17.8072 0.017966 0.78329 0.10353 0.000310456
thread1::elementwise_sub_grad 170 17.6332 0.014866 0.801 0.103724 0.000307423
thread1::square_grad 159 15.6442 0.019057 0.705937 0.0983912 0.000272747
thread1::fill_constant 300 14.8799 0.006199 0.560092 0.0495998 0.000259422
thread1::elementwise_mul 187 14.2968 0.014804 0.736823 0.0764535 0.000249256
thread1::square 158 13.2274 0.010822 1.8082 0.0837179 0.000230612
thread1::mean 156 12.0512 0.007874 0.847896 0.0772516 0.000210106
thread1::fill_constant_batch_size_like 155 11.8357 0.012104 0.725338 0.0763593 0.000206348
thread1::send 111 6.26758 0.009864 0.352335 0.0564647 0.000109271
thread1::create_double_buffer_reader 149 1.19366 0.001408 0.053387 0.00801114 2.08107e-05
thread1::split_byref 23 1.05312 0.022435 0.117102 0.0457877 1.83604e-05
thread0::ScopeBufferedSSAGraphExecutorAfterRun 90 6600.67 52.3774 115.897 73.3408 0.623317
thread0::ThreadedSSAGraphExecutorPrepare 90 3988.92 34.4373 81.4597 44.3214 0.376683
pserver.log的Profiling如下:
-------------------------> Profiling Report <-------------------------
Note! This Report merge all thread info into one.
Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread
Event Calls Total Min. Max. Ave. Ratio.
sum 720 14603 0.398116 218.89 20.282 0.829053
scale 2160 100.124 0.006067 0.33895 0.0463536 0.00568429
adam 720 2910.96 0.05964 51.2808 4.04299 0.165263
-------------------------> Profiling Report <-------------------------
Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread
Event Calls Total Min. Max. Ave. Ratio.
thread56::sum 12 186.695 0.824641 149.836 15.5579 0.749089
thread56::scale 36 1.16934 0.006696 0.137374 0.0324816 0.0046918
thread56::adam 12 61.3652 0.245226 24.4686 5.11377 0.246219
thread55::sum 12 483.673 0.766154 162.293 40.3061 0.976418
thread55::scale 36 1.98347 0.006295 0.176165 0.0550963 0.00400414
thread55::adam 12 9.69802 0.200259 3.01724 0.808168 0.0195779
thread54::sum 12 159.376 0.948185 136.407 13.2813 0.730827
thread54::scale 36 1.56606 0.006574 0.134693 0.0435016 0.00718125
thread54::adam 12 57.134 0.160922 25.8742 4.76117 0.261992
thread53::sum 12 315.124 0.89058 162.892 26.2604 0.790483
thread53::scale 36 1.59852 0.006214 0.180902 0.0444034 0.00400986
thread53::adam 12 81.925 0.267202 28.0194 6.82709 0.205507
thread52::sum 12 32.6205 0.876013 9.20827 2.71838 0.273131
thread52::scale 36 1.60248 0.00795 0.133358 0.0445134 0.0134175
thread52::adam 12 85.2089 0.237518 27.8905 7.10074 0.713452
thread51::sum 12 330.08 0.561279 157.69 27.5067 0.89516
thread51::scale 36 1.40198 0.006239 0.107224 0.038944 0.0038021
thread51::adam 12 37.2566 0.198774 24.3433 3.10472 0.101038
thread50::sum 12 166.812 0.897113 140.615 13.901 0.922653
thread50::scale 36 1.60449 0.00639 0.102774 0.0445692 0.00887459
thread50::adam 12 12.3796 0.214678 5.11717 1.03164 0.0684729
thread49::sum 12 297.091 0.795978 140.713 24.7576 0.898283
thread49::scale 36 1.38741 0.006183 0.096483 0.0385391 0.00419496
thread49::adam 12 32.2536 0.142105 24.1988 2.6878 0.0975219
thread48::sum 13 156.685 0.918431 138.135 12.0527 0.597748
thread48::scale 39 1.80445 0.006908 0.145863 0.046268 0.0068839
thread48::adam 13 103.636 0.145078 26.4769 7.97203 0.395368
thread47::sum 13 291.431 0.7818 142.378 22.4178 0.970983
thread47::scale 39 2.30168 0.00631 0.187686 0.0590175 0.00766869
thread47::adam 13 6.4075 0.204372 1.40484 0.492884 0.0213483
thread46::sum 13 305.36 0.902332 138.543 23.4892 0.833553
thread46::scale 39 1.55396 0.006762 0.140672 0.0398451 0.0042419
thread46::adam 13 59.4216 0.207129 24.3573 4.57089 0.162205
thread45::sum 13 37.9585 1.04528 8.58173 2.91989 0.498705
thread45::scale 39 1.73666 0.008507 0.142952 0.0445298 0.0228165
thread45::adam 13 36.419 0.226543 24.6673 2.80146 0.478479
thread44::sum 13 201.441 0.758061 156.317 15.4955 0.827726
thread44::scale 39 1.67924 0.006662 0.177386 0.0430575 0.00690005
thread44::adam 13 40.2464 0.279416 24.491 3.09588 0.165374
thread43::sum 13 456.888 0.961683 165.465 35.1453 0.926578
thread43::scale 39 1.93931 0.0066 0.158156 0.049726 0.00393296
thread43::adam 13 34.2646 0.174004 24.2578 2.63574 0.0694892
thread42::sum 13 299.918 0.909198 137.527 23.0707 0.887942
thread42::scale 39 1.90426 0.006396 0.11771 0.0488273 0.00563778
thread42::adam 13 35.9455 0.20051 24.2599 2.76504 0.106421
thread41::sum 13 296.412 0.622487 137.236 22.801 0.778724
thread41::scale 39 1.67193 0.006507 0.133644 0.0428699 0.00439243
thread41::adam 13 82.5544 0.118687 25.5742 6.35034 0.216884
thread40::sum 13 330.701 0.913908 170.893 25.4385 0.859859
thread40::scale 39 2.31203 0.006914 0.150784 0.0592827 0.00601152
thread40::adam 13 51.5861 0.227056 42.2294 3.96816 0.13413
thread39::sum 13 300.715 0.917226 140.377 23.1319 0.831133
thread39::scale 39 1.76612 0.006513 0.211945 0.0452853 0.00488131
thread39::adam 13 59.3324 0.142193 26.5372 4.56403 0.163986
thread38::sum 13 153.146 0.889188 136.114 11.7805 0.730058
thread38::scale 39 2.13414 0.006529 0.175277 0.0547216 0.0101736
thread38::adam 13 54.4922 0.231962 25.2431 4.1917 0.259768
thread37::sum 13 349.404 0.682413 177.63 26.8772 0.954478
thread37::scale 39 1.73901 0.006741 0.184736 0.0445901 0.00475051
thread37::adam 13 14.9252 0.124565 3.0241 1.1481 0.0407717
thread36::sum 13 305.996 0.978411 146.479 23.5382 0.829802
thread36::scale 39 2.02339 0.007391 0.132695 0.0518818 0.00548705
thread36::adam 13 60.7384 0.127481 27.9429 4.67219 0.164711
thread35::sum 13 494.258 0.823066 162.477 38.0198 0.885893
thread35::scale 39 2.1029 0.006147 0.181214 0.0539206 0.00376918
thread35::adam 13 61.5597 0.156423 26.0769 4.73536 0.110338
thread34::sum 13 187.21 0.769729 163.458 14.4008 0.762016
thread34::scale 39 1.7792 0.006439 0.161436 0.0456206 0.00724202
thread34::adam 13 56.6883 0.23822 24.5173 4.36064 0.230742
thread33::sum 13 54.5418 1.16457 8.6478 4.19552 0.436209
thread33::scale 39 1.43874 0.007853 0.152865 0.0368907 0.0115066
thread33::adam 13 69.0554 0.188671 27.7681 5.31195 0.552284
thread32::sum 13 51.0084 0.999526 8.72337 3.92372 0.533508
thread32::scale 39 1.84941 0.008416 0.156523 0.0474208 0.0193434
thread32::adam 13 42.7517 0.289186 24.4924 3.28859 0.447149
thread31::sum 13 315.462 1.00572 153.417 24.2663 0.823851
thread31::scale 39 1.79913 0.006493 0.129017 0.0461317 0.00469856
thread31::adam 13 65.6505 0.162062 33.1232 5.05004 0.171451
thread30::sum 13 667.29 0.973816 218.89 51.33 0.949844
thread30::scale 39 1.83992 0.006423 0.134325 0.0471775 0.00261901
thread30::adam 13 33.3957 0.11876 25.4171 2.5689 0.0475366
thread29::sum 13 24.8553 0.691669 8.19566 1.91195 0.303589
thread29::scale 39 1.94978 0.0092 0.11974 0.0499943 0.023815
thread29::adam 13 55.0665 0.095966 24.4508 4.23588 0.672596
thread28::sum 13 47.6725 1.01185 17.2035 3.66712 0.432911
thread28::scale 39 2.03644 0.008328 0.200611 0.0522165 0.0184928
thread28::adam 13 60.4118 0.156938 24.3182 4.64706 0.548596
thread27::sum 13 298.984 0.91568 137.445 22.9988 0.777921
thread27::scale 39 1.48761 0.006456 0.127339 0.0381438 0.00387058
thread27::adam 13 83.8658 0.289919 27.1467 6.45122 0.218209
thread26::sum 13 308.024 0.398116 141.417 23.6942 0.953681
thread26::scale 39 2.10961 0.006736 0.190414 0.0540924 0.0065316
thread26::adam 13 12.8508 0.127671 2.89404 0.988524 0.0397877
thread25::sum 13 426.326 1.01315 137.71 32.7943 0.872783
thread25::scale 39 1.97765 0.006067 0.159565 0.050709 0.00404869
thread25::adam 13 60.1639 0.05964 26.7858 4.628 0.123169
thread24::sum 13 155.746 0.816433 135.003 11.9805 0.934608
thread24::scale 39 2.76032 0.006362 0.281195 0.0707775 0.0165642
thread24::adam 13 8.1368 0.255332 2.84823 0.625908 0.0488276
thread23::sum 13 25.8642 0.874502 8.0217 1.98955 0.310776
thread23::scale 39 2.35844 0.007685 0.33895 0.0604728 0.0283383
thread23::adam 13 55.0019 0.134988 24.7021 4.23091 0.660886
thread22::sum 13 301.314 0.880579 143.149 23.178 0.959843
thread22::scale 39 2.66436 0.006628 0.190116 0.068317 0.00848738
thread22::adam 13 9.9416 0.290237 3.34103 0.764738 0.0316692
thread21::sum 13 34.9364 0.962374 8.35851 2.68742 0.295531
thread21::scale 39 1.55142 0.008527 0.112243 0.0397799 0.0131236
thread21::adam 13 81.7278 0.231207 24.3374 6.28675 0.691345
thread20::sum 13 446.042 1.01206 157.307 34.3109 0.836679
thread20::scale 39 1.90418 0.006581 0.130494 0.048825 0.00357183
thread20::adam 13 85.164 0.235525 26.8123 6.55108 0.159749
thread19::sum 13 341.439 0.837449 163.682 26.2645 0.837438
thread19::scale 39 1.47895 0.006154 0.199478 0.0379217 0.00362738
thread19::adam 13 64.8007 0.151909 24.4713 4.98467 0.158935
thread18::sum 13 456.053 0.796035 161.834 35.081 0.926317
thread18::scale 39 1.97224 0.006071 0.293454 0.0505702 0.00400593
thread18::adam 13 34.3038 0.131301 24.725 2.63876 0.0696766
thread17::sum 13 318.894 0.996461 142.069 24.5303 0.827401
thread17::scale 39 1.95817 0.006377 0.254958 0.0502094 0.00508066
thread17::adam 13 64.5644 0.241266 24.6702 4.9665 0.167519
thread16::sum 13 291.522 0.683259 140.136 22.4248 0.836797
thread16::scale 39 1.43785 0.006556 0.084183 0.0368679 0.00412725
thread16::adam 13 55.4187 0.12446 26.1774 4.26298 0.159076
thread15::sum 13 303.415 0.757413 136.836 23.3396 0.885039
thread15::scale 39 1.65926 0.006323 0.14057 0.042545 0.00483993
thread15::adam 13 37.7523 0.103088 24.5471 2.90403 0.110121
thread14::sum 13 165.77 0.687292 135.521 12.7515 0.599042
thread14::scale 39 1.42776 0.006194 0.157708 0.0366093 0.0051595
thread14::adam 13 109.527 0.222062 51.2808 8.42519 0.395799
thread13::sum 13 23.8163 0.846704 7.99499 1.83203 0.260574
thread13::scale 39 1.74557 0.008465 0.157696 0.0447583 0.0190983
thread13::adam 13 65.8377 0.074946 33.6993 5.06444 0.720328
thread12::sum 13 44.455 0.908589 8.0047 3.41961 0.39983
thread12::scale 39 1.77746 0.00865 0.260496 0.0455759 0.0159865
thread12::adam 13 64.9523 0.207097 26.8685 4.99633 0.584184
thread11::sum 13 162.658 0.595326 135.515 12.5121 0.576125
thread11::scale 39 1.49146 0.006815 0.098827 0.0382424 0.00528266
thread11::adam 13 118.181 0.077595 46.4188 9.09087 0.418592
thread10::sum 13 476.553 0.719884 183.813 36.6579 0.977591
thread10::scale 39 1.74765 0.006167 0.121551 0.0448116 0.0035851
thread10::adam 13 9.17602 0.081131 2.86736 0.705848 0.0188235
thread9::sum 13 599.527 0.792357 155.704 46.1174 0.975547
thread9::scale 39 1.51719 0.00653 0.129496 0.0389022 0.00246876
thread9::adam 13 13.5103 0.066592 3.06828 1.03926 0.0219839
thread8::sum 13 296.805 0.828923 137.591 22.8311 0.823932
thread8::scale 39 1.87482 0.006359 0.142041 0.0480724 0.00520452
thread8::adam 13 61.5501 0.12595 27.2032 4.73462 0.170864
thread7::sum 13 443.712 0.647475 148.035 34.1317 0.978329
thread7::scale 39 1.98795 0.006271 0.1448 0.050973 0.00438318
thread7::adam 13 7.84051 0.119144 2.07625 0.603116 0.0172873
thread6::sum 13 193.585 0.846455 150.551 14.8912 0.74553
thread6::scale 39 1.43208 0.00709 0.144018 0.03672 0.00551519
thread6::adam 13 64.6441 0.136264 24.7204 4.97262 0.248955
thread5::sum 13 200.474 0.884548 169.473 15.4211 0.682018
thread5::scale 39 1.58189 0.006339 0.139907 0.0405613 0.00538162
thread5::adam 13 91.8868 0.166478 34.106 7.06822 0.312601
thread4::sum 13 36.4495 0.819802 9.13764 2.80381 0.250944
thread4::scale 39 1.40816 0.008029 0.127165 0.0361066 0.00969473
thread4::adam 13 107.392 0.092883 26.2525 8.26092 0.739361
thread3::sum 13 348.998 0.564498 179.528 26.846 0.964135
thread3::scale 39 1.74711 0.006443 0.115286 0.0447976 0.00482651
thread3::adam 13 11.2355 0.118935 2.87944 0.864267 0.0310388
thread2::sum 13 167.072 0.824552 138.196 12.8517 0.819188
thread2::scale 39 1.78716 0.006434 0.171067 0.0458245 0.00876281
thread2::adam 13 35.0891 0.06608 25.2633 2.69916 0.172049
thread1::sum 13 434.773 0.930336 139.571 33.4441 0.922999
thread1::scale 39 1.602 0.006155 0.204693 0.0410768 0.00340095
thread1::adam 13 34.6689 0.247305 24.6821 2.66684 0.0736002
代码可以私hi:caowei07