Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • Paddle
  • Issue
  • #15553

P
Paddle
  • 项目概览

PaddlePaddle / Paddle
大约 2 年 前同步成功

通知 2325
Star 20933
Fork 5424
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 1423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
Paddle
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 1,423
    • Issue 1,423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
    • 合并请求 543
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 1月 28, 2019 by saxon_zh@saxon_zhGuest

mpi训练速度还是很慢,已经使用py_reader读取数据

Created by: 333caowei

dssm模型,训练速度: Batch time: 15.881043, sample_per_second: 32.239696, py_reader_queue_size: 188

从Profiling从没看出瓶颈是啥

train.log的Profiling如下:

------------------------->     Profiling Report     <-------------------------

Note! This Report merge all thread info into one.
Place: CPU
Time unit: ms
Sorted by total time in descending order in the same thread

Event                                    Calls       Total       Min.        Max.        Ave.        Ratio.      
fetch_barrier                            90          761229      202.291     17143       8458.1      0.450949    
sequence_conv_grad                       40320       160199      1.54895     27.9581     3.97318     0.0949011   
batch_norm_grad                          20160       117971      2.17369     26.3294     5.85176     0.0698859   
mul_grad                                 20160       102925      1.62898     41.2276     5.10542     0.0609725   
batch_norm                               20160       89984.8     1.76818     17.7321     4.46353     0.0533067   
sequence_conv                            40320       85197.9     0.772446    25.6383     2.11304     0.0504709   
mul                                      20160       57996.2     0.78887     21.5311     2.8768      0.0343568   
sum                                      30240       47220.6     0.060923    11.9747     1.56153     0.0279733   
concat                                   36090       33670.3     0.016723    44.3501     0.932953    0.0199462   
elementwise_add                          65520       30987.1     0.011574    22.7972     0.472941    0.0183566   
elementwise_add_grad                     65520       27419.4     0.019232    10.9512     0.41849     0.0162432   
lookup_table                             55440       23649.9     0.020133    9.46655     0.426585    0.0140101   
sequence_pool                            40320       22695.6     0.207364    23.9766     0.562887    0.0134448   
concat_grad                              30240       15346.1     0.10278     7.52225     0.507476    0.00909096  
sequence_pool_grad                       40320       15308.8     0.045362    8.04285     0.379684    0.0090689   
lookup_table_grad                        55440       14739.7     0.01907     11.9122     0.265867    0.00873172  
tanh                                     60480       12936.1     0.072926    1.33797     0.213891    0.0076633   
tanh_grad                                60480       9817.2      0.053188    3.17885     0.162321    0.00581567  
scale                                    20160       8744.07     0.057911    7.86351     0.433734    0.00517995  
broadcast                                3690        6662.5      0.132456    7.95325     1.80556     0.00394684  
ScopeBufferedSSAGraphExecutorAfterRun    90          6600.67     52.3774     115.897     73.3408     0.00391021  
reduce                                   3690        5894.4      0.145929    28.4731     1.5974      0.00349182  
send_barrier                             90          5259.27     15.6305     321.357     58.4363     0.00311557  
cos_sim_grad                             5040        5190.01     0.176672    13.8599     1.02976     0.00307454  
ThreadedSSAGraphExecutorPrepare          90          3988.92     34.4373     81.4597     44.3214     0.00236302  
read                                     10080       2794.7      0.031585    2.85077     0.277252    0.00165557  
split_selected_rows                      270         2273.31     2.69046     25.6101     8.41965     0.0013467   
auc                                      10080       1817.73     0.045017    1.66554     0.180331    0.00107682  
cos_sim                                  5040        1610.53     0.142193    2.27312     0.319549    0.000954071 
cast                                     5040        1311.34     0.011168    5.96485     0.260187    0.000776836 
recv                                     3690        1088.71     0.010092    1.07391     0.295043    0.000644948 
elementwise_mul_grad                     5040        880.773     0.022939    2.36265     0.174756    0.000521766 
elementwise_sub                          10080       835.183     0.012162    1.57411     0.0828554   0.000494759 
square_grad                              5040        502.86      0.013737    1.68386     0.0997738   0.000297892 
mean_grad                                5040        493.471     0.015765    1.7871      0.097911    0.000292331 
elementwise_sub_grad                     5040        482.207     0.011793    2.03981     0.0956761   0.000285658 
fill_constant                            10080       453.586     0.004554    0.865451    0.0449986   0.000268703 
mean                                     5040        424.447     0.006344    2.09921     0.0842158   0.000251441 
square                                   5040        405.676     0.008229    1.8082      0.0804913   0.000240321 
fill_constant_batch_size_like            5040        381.764     0.010112    1.40453     0.0757468   0.000226156 
elementwise_mul                          5040        369.702     0.014741    1.24299     0.0733535   0.00021901  
send                                     3690        238.802     0.008305    4.58592     0.064716    0.000141465 
create_double_buffer_reader              5040        35.2148     0.001253    0.123469    0.00698707  2.08611e-05 
split_byref                              540         25.1774     0.017677    0.141838    0.0466247   1.4915e-05  


------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by total time in descending order in the same thread

Event                                             Calls       Total       Min.        Max.        Ave.        Ratio.      
thread32::fetch_barrier                           4           33112.4     997.959     16221.7     8278.1      0.536985    
thread32::sequence_conv_grad                      1250        5157.24     1.65616     19.6345     4.12579     0.0836351   
thread32::batch_norm_grad                         597         3778.16     2.17785     23.8654     6.32857     0.0612705   
thread32::mul_grad                                635         3190.36     1.65257     30.5185     5.0242      0.0517383   
thread32::batch_norm                              574         2824.86     1.77967     11.6922     4.92136     0.0458109   
thread32::sequence_conv                           1172        2609.79     0.81816     12.2794     2.22678     0.0423231   
thread32::mul                                     571         1763.62     0.7956      16.0639     3.08866     0.0286007   
thread32::sum                                     953         1419.51     0.074565    6.76241     1.48952     0.0230203   
thread32::concat                                  1149        1075.02     0.019526    27.7234     0.935611    0.0174336   
thread32::elementwise_add                         2008        1016.74     0.016663    22.7972     0.506347    0.0164886   
thread32::elementwise_add_grad                    2013        814.618     0.019352    9.29175     0.404679    0.0132107   
thread32::lookup_table                            1833        748.177     0.023729    5.47994     0.408171    0.0121332   
thread32::sequence_pool                           1238        711.355     0.219503    23.6129     0.5746      0.0115361   
thread32::concat_grad                             928         486.776     0.116789    4.79188     0.524543    0.00789407  
thread32::lookup_table_grad                       1724        432.322     0.023131    4.43831     0.250767    0.00701098  
thread32::sequence_pool_grad                      1202        427.692     0.057404    5.59981     0.355817    0.0069359   
thread32::tanh                                    1717        388.454     0.075122    0.692794    0.22624     0.00629957  
thread32::tanh_grad                               1882        310.839     0.068421    3.16892     0.165164    0.00504089  
thread32::scale                                   657         269.238     0.068524    7.81513     0.409799    0.00436625  
thread32::broadcast                               109         203.082     0.158403    7.9319      1.86314     0.00329339  
thread32::reduce                                  103         182.949     0.20158     16.3275     1.7762      0.00296689  
thread32::cos_sim_grad                            159         158.318     0.201341    8.285       0.995708    0.00256744  
thread32::read                                    312         85.0389     0.033594    1.14405     0.27256     0.00137908  
thread32::cos_sim                                 206         67.6419     0.153031    1.09429     0.328359    0.00109695  
thread32::split_selected_rows                     7           63.759      2.82136     15.5368     9.10842     0.00103398  
thread32::send_barrier                            2           61.1851     21.1857     39.9994     30.5926     0.000992241 
thread32::cast                                    183         50.5685     0.014297    5.96485     0.27633     0.000820071 
thread32::auc                                     257         45.1107     0.060827    1.24958     0.175528    0.000731562 
thread32::recv                                    111         32.157      0.017725    0.95414     0.289702    0.000521491 
thread32::elementwise_sub                         348         28.1236     0.014441    1.04099     0.0808149   0.000456082 
thread32::elementwise_mul_grad                    165         27.3382     0.028309    2.08034     0.165686    0.000443345 
thread32::square_grad                             153         15.013      0.020982    0.712499    0.098124    0.000243466 
thread32::mean_grad                               162         14.91       0.018528    0.533497    0.0920369   0.000241796 
thread32::elementwise_sub_grad                    163         14.4693     0.014491    0.607762    0.088769    0.00023465  
thread32::fill_constant_batch_size_like           164         14.4182     0.010354    1.40453     0.087916    0.000233821 
thread32::fill_constant                           322         14.1305     0.00535     0.573844    0.0438835   0.000229155 
thread32::square                                  171         12.9158     0.011128    0.520051    0.0755311   0.000209456 
thread32::elementwise_mul                         135         12.4509     0.015132    1.12853     0.092229    0.000201917 
thread32::mean                                    168         10.6736     0.007934    0.682096    0.0635334   0.000173094 
thread32::send                                    134         10.3941     0.011007    0.610125    0.0775683   0.000168562 
thread32::create_double_buffer_reader             144         0.985273    0.001939    0.043271    0.00684217  1.59782e-05 
thread32::split_byref                             16          0.702879    0.024581    0.08162     0.0439299   1.13986e-05 
thread31::fetch_barrier                           2           29707.2     13954.6     15752.5     14853.6     0.510473    
thread31::sequence_conv_grad                      1321        5029.36     1.61402     18.1841     3.80724     0.0864221   
thread31::batch_norm_grad                         661         3737.72     2.19651     17.5022     5.65465     0.0642272   
thread31::mul_grad                                604         3110.82     1.65088     30.4991     5.15036     0.0534547   
thread31::batch_norm                              640         2850.89     1.7809      15.0088     4.45451     0.0489882   
thread31::sequence_conv                           1252        2635.87     0.804854    19.8472     2.10533     0.0452935   
thread31::mul                                     642         1845.88     0.793949    16.7166     2.8752      0.0317186   
thread31::sum                                     1006        1470.65     0.073308    8.40091     1.46188     0.025271    
thread31::concat                                  1171        998.557     0.020134    15.9977     0.852739    0.0171587   
thread31::elementwise_add                         2017        928.144     0.018199    22.7931     0.460161    0.0159488   
thread31::elementwise_add_grad                    2062        846.29      0.02439     9.41433     0.410422    0.0145422   
thread31::lookup_table                            1753        743.682     0.023555    6.47188     0.424234    0.0127791   
thread31::sequence_pool                           1278        710.93      0.217464    23.5659     0.556283    0.0122163   
thread31::sequence_pool_grad                      1234        482.803     0.057885    5.11503     0.391251    0.00829625  
thread31::concat_grad                             975         481.379     0.114148    4.63596     0.493722    0.00827178  
thread31::lookup_table_grad                       1801        474.612     0.021095    5.36527     0.263527    0.0081555   
thread31::tanh                                    1905        400.802     0.074705    0.641755    0.210395    0.00688719  
thread31::tanh_grad                               1891        310.317     0.063158    0.580271    0.164102    0.00533234  
thread31::scale                                   598         261.63      0.072468    6.32622     0.437508    0.00449572  
thread31::broadcast                               115         213.454     0.144017    7.52968     1.85612     0.00366789  
thread31::reduce                                  114         188.056     0.240406    16.0613     1.64962     0.00323147  
thread31::cos_sim_grad                            160         168.432     0.18633     5.78947     1.0527      0.00289426  
thread31::split_selected_rows                     12          104.382     2.71594     15.3531     8.69848     0.00179365  
thread31::read                                    336         89.1137     0.038166    1.08559     0.265219    0.00153129  
thread31::send_barrier                            2           55.4628     23.7659     31.6969     27.7314     0.000953046 
thread31::auc                                     289         54.4152     0.058296    1.0178      0.188288    0.000935045 
thread31::cos_sim                                 157         47.1081     0.154933    0.770324    0.300052    0.000809483 
thread31::cast                                    149         37.7969     0.014777    2.52957     0.25367     0.000649483 
thread31::recv                                    121         35.4511     0.01536     1.0179      0.292984    0.000609174 
thread31::elementwise_mul_grad                    165         33.5809     0.028159    1.22163     0.20352     0.000577037 
thread31::elementwise_sub                         315         21.4376     0.015954    0.569542    0.0680557   0.000368372 
thread31::mean_grad                               135         16.1359     0.021089    1.33524     0.119525    0.000277271 
thread31::fill_constant                           327         15.5924     0.005519    0.781848    0.0476832   0.000267932 
thread31::square_grad                             136         15.5098     0.01714     0.784029    0.114043    0.000266513 
thread31::square                                  173         13.7716     0.012742    0.743037    0.0796045   0.000236644 
thread31::mean                                    166         13.7078     0.00865     0.770362    0.0825773   0.000235549 
thread31::elementwise_mul                         170         12.0182     0.017927    0.577616    0.0706953   0.000206515 
thread31::fill_constant_batch_size_like           146         11.6476     0.011157    0.735868    0.079778    0.000200147 
thread31::elementwise_sub_grad                    137         11.5625     0.015112    0.534516    0.0843981   0.000198685 
thread31::send                                    116         7.34807     0.010338    0.342645    0.0633455   0.000126266 
thread31::create_double_buffer_reader             181         1.38442     0.001959    0.05295     0.00764875  2.37893e-05 
thread31::split_byref                             12          0.464637    0.023763    0.079217    0.0387197   7.98409e-06 
thread30::fetch_barrier                           6           32927.3     768.03      16241.7     5487.89     0.535654    
thread30::sequence_conv_grad                      1260        5117.78     1.65352     23.1999     4.06173     0.083255    
thread30::batch_norm_grad                         612         3779.68     2.18197     18.595      6.17594     0.0614869   
thread30::mul_grad                                620         3198.24     1.64517     23.467      5.15845     0.0520282   
thread30::batch_norm                              639         2865.91     1.77916     11.9075     4.485       0.046622    
thread30::sequence_conv                           1224        2653.58     0.788904    23.9        2.16796     0.0431679   
thread30::mul                                     637         1829.54     0.790907    15.263      2.87212     0.0297626   
thread30::sum                                     928         1435.44     0.08474     8.8055      1.54681     0.0233514   
thread30::concat                                  1168        1015.29     0.019718    29.8214     0.869259    0.0165166   
thread30::elementwise_add                         2096        952.525     0.012715    9.00628     0.454449    0.0154955   
thread30::elementwise_add_grad                    1988        802.263     0.020041    7.57228     0.403553    0.013051    
thread30::lookup_table                            1650        736.848     0.021951    6.31711     0.446575    0.0119869   
thread30::sequence_pool                           1256        712.279     0.21458     23.5894     0.567101    0.0115872   
thread30::concat_grad                             910         458.561     0.114077    4.9729      0.503914    0.00745977  
thread30::lookup_table_grad                       1737        443.367     0.020457    7.81215     0.255249    0.00721259  
thread30::sequence_pool_grad                      1212        435.615     0.051663    5.57388     0.359419    0.00708649  
thread30::tanh                                    1994        427.126     0.079966    0.645406    0.214206    0.00694839  
thread30::tanh_grad                               1791        290.28      0.060298    1.0224      0.162077    0.00472221  
thread30::scale                                   648         244.869     0.070999    6.07203     0.377884    0.00398347  
thread30::broadcast                               105         194.755     0.145982    7.3481      1.85481     0.00316823  
thread30::reduce                                  112         179.287     0.211917    15.6643     1.60078     0.0029166   
thread30::cos_sim_grad                            158         152.698     0.206023    7.29127     0.966441    0.00248405  
thread30::send_barrier                            2           111.308     54.8178     56.4898     55.6538     0.00181073  
thread30::read                                    342         86.8546     0.040959    1.93822     0.253961    0.00141293  
thread30::split_selected_rows                     9           73.1806     3.77164     14.394      8.13118     0.00119049  
thread30::auc                                     316         54.7942     0.055328    1.31401     0.173399    0.000891379 
thread30::cos_sim                                 142         44.862      0.146846    1.00868     0.315929    0.000729804 
thread30::cast                                    151         37.3791     0.01633     4.96162     0.247543    0.000608074 
thread30::recv                                    129         37.0795     0.018008    0.941024    0.287438    0.000603201 
thread30::elementwise_mul_grad                    147         27.4848     0.026588    1.25138     0.186972    0.000447117 
thread30::elementwise_sub                         309         23.8011     0.013637    0.851214    0.0770263   0.000387191 
thread30::elementwise_sub_grad                    168         18.0279     0.01479     0.901324    0.107309    0.000293273 
thread30::square_grad                             162         16.7725     0.017734    0.897742    0.103534    0.000272851 
thread30::mean                                    158         13.9405     0.008385    1.27701     0.0882308   0.000226781 
thread30::mean_grad                               147         13.6628     0.017457    0.747887    0.0929445   0.000222264 
thread30::square                                  157         13.4525     0.010839    1.16464     0.0856845   0.000218842 
thread30::fill_constant                           302         13.0953     0.005773    0.617482    0.043362    0.000213032 
thread30::elementwise_mul                         141         11.8438     0.018806    1.24299     0.0839989   0.000192673 
thread30::fill_constant_batch_size_like           179         11.0472     0.010749    0.495749    0.0617162   0.000179713 
thread30::send                                    114         7.40501     0.011387    0.772236    0.0649562   0.000120463 
thread30::split_byref                             20          1.00707     0.022252    0.099465    0.0503533   1.63827e-05 
thread30::create_double_buffer_reader             164         0.958619    0.00202     0.031131    0.00584524  1.55946e-05 
thread29::sequence_conv_grad                      1209        5134.83     1.6826      25.5124     4.24717     0.169955    
thread29::batch_norm_grad                         621         3733.77     2.20231     20.0303     6.01252     0.123582    
thread29::mul_grad                                620         3130.63     1.67314     21.9439     5.04941     0.103619    
thread29::batch_norm                              623         2836.01     1.78996     12.3924     4.55218     0.0938676   
thread29::sequence_conv                           1234        2675.64     0.798467    23.7637     2.16826     0.0885596   
thread29::mul                                     619         1788.76     0.793701    12.5842     2.88976     0.0592052   
thread29::fetch_barrier                           2           1534.81     692.723     842.083     767.403     0.0507998   
thread29::sum                                     841         1433.89     0.060923    10.6762     1.70498     0.0474596   
thread29::concat                                  1117        1063.61     0.020658    29.6565     0.9522      0.0352038   
thread29::elementwise_add                         2021        956.749     0.014785    6.68461     0.473404    0.0316669   
thread29::elementwise_add_grad                    1954        865.14      0.026153    7.45574     0.442753    0.0286348   
thread29::lookup_table                            1718        730.907     0.024758    6.03704     0.425441    0.0241919   
thread29::sequence_pool                           1235        702.823     0.216771    23.4534     0.569087    0.0232624   
thread29::concat_grad                             929         468.455     0.10278     4.35634     0.504258    0.0155052   
thread29::sequence_pool_grad                      1237        456.783     0.046017    5.8224      0.369267    0.0151188   
thread29::lookup_table_grad                       1694        454.528     0.020587    6.07156     0.268316    0.0150442   
thread29::tanh                                    1906        409.875     0.07901     0.627835    0.215045    0.0135662   
thread29::tanh_grad                               1896        313.134     0.061119    3.17885     0.165155    0.0103643   
thread29::scale                                   652         284.882     0.067297    6.39905     0.436936    0.00942916  
thread29::broadcast                               117         211.614     0.139975    7.5873      1.80866     0.00700409  
thread29::send_barrier                            3           197.263     57.7332     81.736      65.7544     0.00652911  
thread29::cos_sim_grad                            174         180.211     0.206435    10.1919     1.03569     0.0059647   
thread29::reduce                                  104         163.171     0.223928    16.0822     1.56895     0.00540071  
thread29::read                                    280         90.7388     0.048997    1.39478     0.324067    0.00300332  
thread29::auc                                     307         54.176      0.058409    0.930818    0.176469    0.00179314  
thread29::cos_sim                                 155         52.9591     0.166349    1.4586      0.341671    0.00175287  
thread29::cast                                    147         38.7476     0.012766    3.36491     0.263589    0.00128249  
thread29::recv                                    117         37.3646     0.017274    0.929309    0.319356    0.00123671  
thread29::split_selected_rows                     4           33.1586     4.21481     11.1726     8.28966     0.0010975   
thread29::elementwise_mul_grad                    170         32.8059     0.025101    2.35777     0.192976    0.00108583  
thread29::elementwise_sub                         303         27.6947     0.01656     1.03651     0.0914016   0.000916652 
thread29::elementwise_sub_grad                    170         16.7377     0.014155    0.843866    0.0984568   0.000553991 
thread29::square                                  174         15.0917     0.011029    1.36976     0.0867338   0.000499512 
thread29::mean                                    155         14.8939     0.012552    0.921539    0.09609     0.000492967 
thread29::mean_grad                               154         14.4337     0.018362    0.662406    0.0937252   0.000477733 
thread29::square_grad                             122         12.7592     0.017876    1.68386     0.104584    0.00042231  
thread29::fill_constant                           304         12.6673     0.004984    0.38295     0.0416689   0.00041927  
thread29::fill_constant_batch_size_like           188         12.5862     0.014033    0.69352     0.066948    0.000416585 
thread29::elementwise_mul                         162         10.0433     0.018114    0.40815     0.0619957   0.000332418 
thread29::send                                    103         6.63412     0.012553    0.661339    0.064409    0.000219579 
thread29::split_byref                             21          0.956443    0.020687    0.086206    0.0455449   3.16568e-05 
thread29::create_double_buffer_reader             136         0.930767    0.002173    0.040413    0.00684388  3.0807e-05  
thread28::fetch_barrier                           3           15566       731.701     14054.9     5188.67     0.352785    
thread28::sequence_conv_grad                      1217        5015.5      1.60556     15.9852     4.1212      0.11367     
thread28::batch_norm_grad                         590         3660.69     2.19591     19.9525     6.20456     0.0829651   
thread28::mul_grad                                626         3237.75     1.6841      27.1763     5.17213     0.0733798   
thread28::batch_norm                              600         2856.46     1.80548     12.9678     4.76077     0.0647383   
thread28::sequence_conv                           1214        2672.02     0.826396    23.1496     2.20101     0.0605581   
thread28::mul                                     616         1786.42     0.799326    15.0881     2.90003     0.0404871   
thread28::sum                                     929         1451.88     0.07905     7.64262     1.56285     0.0329052   
thread28::concat                                  1159        1021.03     0.019928    20.1853     0.880954    0.0231403   
thread28::elementwise_add                         1940        931.851     0.015172    10.8858     0.480336    0.0211193   
thread28::elementwise_add_grad                    1964        858.692     0.020825    8.10077     0.437216    0.0194612   
thread28::lookup_table                            1659        744.649     0.020372    6.40308     0.448854    0.0168766   
thread28::sequence_pool                           1230        702.727     0.216314    23.8177     0.571322    0.0159264   
thread28::concat_grad                             975         512.901     0.112892    3.50805     0.526052    0.0116243   
thread28::sequence_pool_grad                      1284        488.831     0.045362    4.95196     0.380709    0.0110788   
thread28::lookup_table_grad                       1721        456.076     0.025844    7.38565     0.265006    0.0103364   
thread28::tanh                                    1832        397.226     0.079166    0.663381    0.216827    0.00900266  
thread28::tanh_grad                               1806        294.878     0.069085    1.2496      0.163277    0.00668306  
thread28::scale                                   585         264.143     0.071104    5.4815      0.451526    0.00598647  
thread28::broadcast                               124         220.785     0.141802    6.03605     1.78052     0.00500382  
thread28::send_barrier                            3           210.875     48.6153     98.0241     70.2917     0.00477923  
thread28::cos_sim_grad                            155         170.659     0.186766    6.65201     1.10102     0.00386777  
thread28::reduce                                  101         114.383     0.233775    16.9968     1.1325      0.00259235  
thread28::read                                    340         84.0425     0.034425    1.61124     0.247184    0.00190472  
thread28::auc                                     340         65.2949     0.05537     1.41675     0.192044    0.00147983  
thread28::split_selected_rows                     7           54.5682     4.1844      12.0312     7.79546     0.00123672  
thread28::cos_sim                                 147         45.8704     0.145236    1.10472     0.312044    0.0010396   
thread28::cast                                    155         41.7897     0.0144      2.37473     0.269611    0.000947112 
thread28::recv                                    122         31.3298     0.011324    0.960444    0.256802    0.000710053 
thread28::elementwise_mul_grad                    171         24.7812     0.027175    0.847965    0.144919    0.000561637 
thread28::elementwise_sub                         265         22.6835     0.016036    1.26494     0.085598    0.000514093 
thread28::fill_constant                           337         15.7978     0.005704    0.606248    0.0468779   0.000358039 
thread28::mean_grad                               148         15.1091     0.018017    0.775741    0.102089    0.00034243  
thread28::square                                  167         13.814      0.012455    0.944013    0.0827185   0.000313077 
thread28::elementwise_sub_grad                    137         13.6352     0.013938    0.897899    0.0995268   0.000309025 
thread28::square_grad                             157         13.2844     0.019777    0.737475    0.0846137   0.000301074 
thread28::mean                                    153         12.5067     0.009462    0.940145    0.081743    0.000283449 
thread28::fill_constant_batch_size_like           181         12.4944     0.013174    0.539264    0.0690298   0.00028317  
thread28::elementwise_mul                         167         10.7113     0.017727    0.492118    0.0641397   0.000242759 
thread28::send                                    113         7.10326     0.013561    0.433759    0.0628607   0.000160987 
thread28::create_double_buffer_reader             172         1.2063      0.002109    0.049891    0.0070134   2.73394e-05 
thread28::split_byref                             17          0.774264    0.029472    0.092702    0.0455449   1.75478e-05 
thread27::fetch_barrier                           4           29406.5     1209.77     14143.4     7351.64     0.506261    
thread27::sequence_conv_grad                      1237        5006.36     1.64577     20.7867     4.04718     0.0861892   
thread27::batch_norm_grad                         585         3409.13     2.19779     17.9873     5.82757     0.0586913   
thread27::mul_grad                                630         3278.65     1.64017     24.3022     5.20421     0.0564451   
thread27::batch_norm                              622         2839.97     1.86759     12.6277     4.56587     0.0488927   
thread27::sequence_conv                           1279        2684.07     0.889042    24.6597     2.09857     0.0462088   
thread27::mul                                     577         1748.47     0.796849    15.6979     3.03027     0.0301015   
thread27::sum                                     933         1480.77     0.078054    8.60146     1.58711     0.0254928   
thread27::concat                                  1128        1063.8      0.020099    27.484      0.943087    0.0183143   
thread27::elementwise_add                         2026        960.644     0.014543    6.94819     0.474158    0.0165384   
thread27::elementwise_add_grad                    2089        889.251     0.022686    9.43554     0.425683    0.0153093   
thread27::lookup_table                            1765        735.363     0.021839    6.3001      0.416636    0.01266     
thread27::sequence_pool                           1265        707.816     0.216244    23.5843     0.559538    0.0121857   
thread27::concat_grad                             1040        527.55      0.115001    4.24172     0.50726     0.00908226  
thread27::sequence_pool_grad                      1252        507.849     0.052333    7.04819     0.40563     0.00874309  
thread27::lookup_table_grad                       1768        463.918     0.025629    10.0056     0.262397    0.00798679  
thread27::tanh                                    1901        409.946     0.082556    0.659873    0.215648    0.00705761  
thread27::tanh_grad                               1935        312.516     0.060412    0.568534    0.161507    0.00538025  
thread27::scale                                   632         285.53      0.064897    7.33192     0.451788    0.00491566  
thread27::send_barrier                            3           233.117     48.4784     121.599     77.7055     0.00401332  
thread27::reduce                                  133         220.836     0.191205    18.0618     1.66042     0.0038019   
thread27::broadcast                               119         205.311     0.158503    6.19072     1.7253      0.00353462  
thread27::cos_sim_grad                            176         176.81      0.19704     5.54927     1.0046      0.00304396  
thread27::split_selected_rows                     10          96.8279     4.65727     17.6932     9.68279     0.00166698  
thread27::read                                    336         82.9406     0.044074    1.47636     0.246847    0.0014279   
thread27::auc                                     328         62.9201     0.053253    1.19445     0.19183     0.00108323  
thread27::cos_sim                                 159         49.7489     0.144835    0.888927    0.312886    0.000856474 
thread27::cast                                    166         41.6675     0.011168    2.08828     0.251009    0.000717344 
thread27::recv                                    108         30.8038     0.017092    0.898941    0.285221    0.000530317 
thread27::elementwise_mul_grad                    157         27.7483     0.027431    1.05382     0.176741    0.000477713 
thread27::elementwise_sub                         304         27.1147     0.013748    1.30397     0.0891932   0.000466805 
thread27::square_grad                             147         13.9542     0.016845    1.04025     0.0949267   0.000240235 
thread27::mean                                    155         13.8892     0.00874     0.846924    0.0896079   0.000239116 
thread27::fill_constant_batch_size_like           169         13.8769     0.012518    1.19909     0.0821115   0.000238903 
thread27::mean_grad                               142         13.8445     0.021263    1.05775     0.0974968   0.000238347 
thread27::elementwise_mul                         174         13.5003     0.01664     0.801072    0.077588    0.000232421 
thread27::fill_constant                           313         13.1475     0.005031    0.666766    0.0420046   0.000226346 
thread27::square                                  138         11.8336     0.011678    1.29488     0.0857509   0.000203727 
thread27::elementwise_sub_grad                    157         10.8754     0.014995    0.511468    0.0692704   0.000187231 
thread27::send                                    109         5.05945     0.011137    0.288446    0.046417    8.71032e-05 
thread27::create_double_buffer_reader             150         0.973883    0.002039    0.034502    0.00649255  1.67663e-05 
thread27::split_byref                             20          0.784766    0.022937    0.07276     0.0392383   1.35105e-05 
thread26::fetch_barrier                           3           14768.1     202.291     13821.9     4922.7      0.340468    
thread26::sequence_conv_grad                      1362        4957.06     1.54895     20.3399     3.63955     0.114281    
thread26::batch_norm_grad                         665         3425.33     2.18786     20.4966     5.15088     0.0789686   
thread26::mul_grad                                701         3407.95     1.64849     27.0761     4.86155     0.0785678   
thread26::batch_norm                              695         2748.69     1.79443     11.9709     3.95495     0.0633691   
thread26::sequence_conv                           1309        2642.26     0.821308    25.3928     2.01853     0.0609154   
thread26::mul                                     673         1834.41     0.79512     15.8526     2.72571     0.0422909   
thread26::sum                                     1025        1486.65     0.070291    7.94914     1.45039     0.0342736   
thread26::concat                                  1174        1097.52     0.01792     28.6442     0.934856    0.0253025   
thread26::elementwise_add                         2176        1003.65     0.014859    10.1676     0.461237    0.0231385   
thread26::elementwise_add_grad                    2254        931.946     0.025983    9.30406     0.413463    0.0214853   
thread26::lookup_table                            1722        743.238     0.023703    6.13125     0.431613    0.0171348   
thread26::sequence_pool                           1297        707.328     0.212425    23.4386     0.545357    0.0163069   
thread26::sequence_pool_grad                      1376        491.403     0.056898    5.86648     0.357124    0.0113289   
thread26::concat_grad                             1029        482.712     0.109688    4.06371     0.469108    0.0111286   
thread26::lookup_table_grad                       1802        479.426     0.019269    6.62234     0.266052    0.0110528   
thread26::tanh                                    1928        393.242     0.076417    0.664953    0.203964    0.00906591  
thread26::tanh_grad                               2050        328.153     0.064962    1.3954      0.160075    0.00756534  
thread26::scale                                   612         269.072     0.067595    4.6713      0.43966     0.00620325  
thread26::broadcast                               114         213.036     0.258859    5.72949     1.86874     0.0049114   
thread26::cos_sim_grad                            152         161.094     0.19905     6.60044     1.05983     0.00371391  
thread26::send_barrier                            3           148.796     32.8186     79.1782     49.5988     0.00343039  
thread26::reduce                                  109         132.541     0.209732    12.9687     1.21597     0.00305564  
thread26::read                                    338         90.842      0.036786    2.42634     0.268763    0.0020943   
thread26::split_selected_rows                     9           86.2314     4.07072     14.9966     9.58127     0.001988    
thread26::auc                                     310         56.8369     0.061882    1.27636     0.183345    0.00131033  
thread26::cos_sim                                 154         45.3135     0.157009    1.02111     0.294244    0.00104467  
thread26::cast                                    153         36.0897     0.013114    2.66384     0.235881    0.000832023 
thread26::recv                                    113         33.5429     0.01243     1.06444     0.29684     0.000773307 
thread26::elementwise_mul_grad                    172         29.9557     0.029852    1.39627     0.174161    0.000690608 
thread26::elementwise_sub                         328         26.8002     0.016472    0.735818    0.0817081   0.00061786  
thread26::square_grad                             170         16.933      0.017313    0.775393    0.0996059   0.000390378 
thread26::mean_grad                               170         15.5973     0.018425    1.54992     0.091749    0.000359585 
thread26::fill_constant                           334         14.475      0.006275    0.502458    0.0433382   0.00033371  
thread26::mean                                    160         13.8796     0.006344    1.08231     0.0867477   0.000319985 
thread26::square                                  149         12.0788     0.012138    1.00908     0.0810654   0.000278467 
thread26::fill_constant_batch_size_like           164         12.0125     0.01159     0.885307    0.0732472   0.00027694  
thread26::elementwise_sub_grad                    153         11.609      0.012805    0.717539    0.0758755   0.000267636 
thread26::elementwise_mul                         153         10.3314     0.014741    0.463774    0.0675253   0.000238182 
thread26::send                                    118         7.77325     0.009705    0.333087    0.065875    0.000179207 
thread26::create_double_buffer_reader             159         1.21812     0.001667    0.067506    0.00766111  2.80828e-05 
thread26::split_byref                             18          0.787693    0.026162    0.119938    0.0437607   1.81597e-05 
thread25::fetch_barrier                           5           30933.8     437.574     14413       6186.75     0.515156    
thread25::sequence_conv_grad                      1219        4853.94     1.60333     19.3605     3.9819      0.0808351   
thread25::batch_norm_grad                         675         3689.35     2.17369     19.1139     5.4657      0.0614406   
thread25::mul_grad                                651         3301.39     1.64546     34.7968     5.07127     0.0549798   
thread25::batch_norm                              669         2826.82     1.78971     14.6573     4.22545     0.0470765   
thread25::sequence_conv                           1324        2687.1      0.838679    23.8879     2.02953     0.0447496   
thread25::mul                                     635         1752.46     0.793336    18.6076     2.75978     0.0291846   
thread25::sum                                     934         1413.03     0.076512    6.8735      1.51288     0.023532    
thread25::concat                                  1101        1078.8      0.020892    28.0689     0.979832    0.0179657   
thread25::elementwise_add                         2064        960.163     0.014645    6.06984     0.465195    0.0159901   
thread25::elementwise_add_grad                    2219        916.183     0.022633    7.32887     0.412881    0.0152577   
thread25::lookup_table                            1737        728.186     0.023995    5.53981     0.419221    0.0121269   
thread25::sequence_pool                           1287        689.526     0.207364    3.43817     0.535763    0.011483    
thread25::send_barrier                            6           684.496     25.769      321.357     114.083     0.0113993   
thread25::lookup_table_grad                       1833        484.109     0.021082    7.39796     0.264108    0.00806212  
thread25::sequence_pool_grad                      1280        480.344     0.057818    5.83773     0.375269    0.00799941  
thread25::concat_grad                             945         452.635     0.108362    3.0855      0.478978    0.00753795  
thread25::tanh                                    1923        404.569     0.074186    0.595578    0.210385    0.0067375   
thread25::tanh_grad                               1931        312.59      0.063312    0.932865    0.16188     0.00520573  
thread25::scale                                   693         300.676     0.07009     5.41297     0.433876    0.00500732  
thread25::broadcast                               121         217.561     0.186888    4.23608     1.79803     0.00362316  
thread25::reduce                                  143         199.914     0.184808    17.0769     1.398       0.00332927  
thread25::cos_sim_grad                            158         158.217     0.201364    7.96774     1.00137     0.00263486  
thread25::read                                    340         84.908      0.035631    2.84462     0.249729    0.00141402  
thread25::split_selected_rows                     7           67.0861     2.74846     19.1169     9.58374     0.00111722  
thread25::auc                                     340         60.2585     0.045785    1.32191     0.177231    0.00100351  
thread25::cos_sim                                 154         50.2302     0.155414    1.36856     0.32617     0.000836508 
thread25::cast                                    176         44.0181     0.013997    4.08449     0.250103    0.000733056 
thread25::recv                                    118         32.212      0.014443    0.989908    0.272983    0.000536442 
thread25::elementwise_mul_grad                    173         28.0014     0.027063    1.3378      0.161858    0.000466322 
thread25::elementwise_sub                         299         23.7887     0.016141    0.633676    0.0795607   0.000396165 
thread25::square_grad                             163         16.3627     0.020534    0.973175    0.100385    0.000272496 
thread25::mean                                    158         15.9343     0.010323    1.56872     0.10085     0.000265362 
thread25::elementwise_sub_grad                    155         15.7587     0.017256    0.958733    0.101669    0.000262437 
thread25::mean_grad                               184         15.5612     0.018803    0.948857    0.0845719   0.000259149 
thread25::square                                  153         15.5063     0.011852    0.933417    0.101349    0.000258235 
thread25::fill_constant                           314         14.7903     0.005365    0.442019    0.047103    0.000246311 
thread25::fill_constant_batch_size_like           159         13.124      0.0106      0.82471     0.0825408   0.000218561 
thread25::send                                    136         12.5483     0.008305    4.58592     0.0922669   0.000208973 
thread25::elementwise_mul                         131         9.44994     0.020559    0.610193    0.0721369   0.000157375 
thread25::create_double_buffer_reader             181         1.17207     0.002304    0.042728    0.00647554  1.95191e-05 
thread25::split_byref                             17          0.875646    0.021916    0.113188    0.0515086   1.45826e-05 
thread24::fetch_barrier                           3           31510.2     680.83      17143       10503.4     0.523696    
thread24::sequence_conv_grad                      1246        4966.83     1.63136     16.3844     3.98622     0.0825481   
thread24::batch_norm_grad                         629         3910.18     2.19468     19.4211     6.2165      0.0649867   
thread24::mul_grad                                633         3127.02     1.68169     34.822      4.94001     0.0519707   
thread24::batch_norm                              615         2831.03     1.80115     10.5274     4.60331     0.0470514   
thread24::sequence_conv                           1255        2626        0.856488    25.6383     2.09243     0.0436437   
thread24::mul                                     610         1825.41     0.794643    13.4341     2.99247     0.0303381   
thread24::sum                                     1001        1517.77     0.080132    7.43092     1.51625     0.0252251   
thread24::concat                                  1108        1043.85     0.019347    36.0599     0.942104    0.0173487   
thread24::elementwise_add                         2113        958.119     0.017199    5.10241     0.45344     0.0159238   
thread24::elementwise_add_grad                    1983        793.357     0.022405    7.8507      0.400079    0.0131855   
thread24::lookup_table                            1746        751.488     0.020903    6.00349     0.430405    0.0124896   
thread24::sequence_pool                           1252        715.856     0.211429    23.5023     0.57177     0.0118974   
thread24::concat_grad                             921         485.583     0.114623    3.30242     0.527235    0.00807033  
thread24::lookup_table_grad                       1667        445.377     0.023916    6.35675     0.267173    0.00740211  
thread24::sequence_pool_grad                      1267        434.856     0.051095    4.9469      0.343217    0.00722724  
thread24::tanh                                    1861        398.152     0.083029    0.743202    0.213945    0.00661724  
thread24::tanh_grad                               1869        307.026     0.053188    0.592716    0.164273    0.00510273  
thread24::scale                                   611         263.829     0.067502    5.44807     0.431799    0.0043848   
thread24::reduce                                  124         243.534     0.217919    19.5786     1.96398     0.0040475   
thread24::broadcast                               119         214.641     0.14022     4.72021     1.80371     0.00356731  
thread24::send_barrier                            4           160.447     23.7894     64.2686     40.1118     0.00266661  
thread24::cos_sim_grad                            152         143.63      0.21049     3.95445     0.944931    0.0023871   
thread24::read                                    346         79.4223     0.040844    0.995213    0.229544    0.00131999  
thread24::auc                                     328         61.083      0.066111    0.735284    0.186229    0.00101519  
thread24::cos_sim                                 168         55.938      0.145622    1.25635     0.332964    0.000929682 
thread24::cast                                    136         42.5291     0.014048    2.47532     0.312714    0.000706828 
thread24::split_selected_rows                     4           41.8476     7.75345     12.2965     10.4619     0.000695502 
thread24::recv                                    121         35.3829     0.014853    0.978213    0.29242     0.000588059 
thread24::elementwise_mul_grad                    167         28.7346     0.027943    1.36036     0.172064    0.000477566 
thread24::elementwise_sub                         332         26.3158     0.013406    0.79711     0.0792644   0.000437365 
thread24::square_grad                             182         17.808      0.020503    0.901623    0.0978461   0.000295967 
thread24::mean_grad                               161         16.0417     0.019502    0.691252    0.0996378   0.000266611 
thread24::fill_constant                           303         15.3339     0.00569     0.865451    0.0506069   0.000254847 
thread24::elementwise_sub_grad                    169         14.4496     0.015178    0.846031    0.0855006   0.00024015  
thread24::elementwise_mul                         173         13.6795     0.016235    0.588444    0.0790721   0.000227351 
thread24::mean                                    145         13.2935     0.009052    0.863991    0.0916796   0.000220937 
thread24::square                                  155         11.774      0.012286    1.13366     0.0759611   0.000195682 
thread24::fill_constant_batch_size_like           171         11.3734     0.010112    0.852956    0.0665113   0.000189025 
thread24::send                                    129         8.22006     0.010999    0.623758    0.0637214   0.000136616 
thread24::create_double_buffer_reader             129         0.841683    0.002289    0.04158     0.00652467  1.39887e-05 
thread24::split_byref                             13          0.64641     0.030271    0.074567    0.0497238   1.07432e-05 
thread23::fetch_barrier                           1           15748.4     15748.4     15748.4     15748.4     0.356557    
thread23::sequence_conv_grad                      1251        5067.39     1.64766     22.9841     4.05067     0.11473     
thread23::batch_norm_grad                         645         3827.23     2.17905     19.4489     5.93369     0.0866517   
thread23::mul_grad                                628         3278.65     1.6527      28.3492     5.22077     0.0742313   
thread23::batch_norm                              614         2827.99     1.78196     17.7321     4.60585     0.0640281   
thread23::sequence_conv                           1230        2609.79     0.856069    23.7345     2.12178     0.059088    
thread23::mul                                     639         1839.6      0.795393    17.9971     2.87887     0.0416501   
thread23::sum                                     907         1505.37     0.071164    11.3352     1.65973     0.0340829   
thread23::concat                                  1076        976.14      0.017187    14.036      0.907193    0.0221006   
thread23::elementwise_add                         1996        960.15      0.016011    7.33187     0.481037    0.0217386   
thread23::elementwise_add_grad                    1959        828.429     0.022723    6.98961     0.422883    0.0187563   
thread23::lookup_table                            1755        779.488     0.021101    6.19306     0.444153    0.0176483   
thread23::sequence_pool                           1238        704.49      0.218145    23.4939     0.569055    0.0159503   
thread23::concat_grad                             838         439.396     0.117301    6.18133     0.524338    0.00994829  
thread23::sequence_pool_grad                      1149        420.364     0.05757     5.03714     0.365852    0.0095174   
thread23::lookup_table_grad                       1442        414.798     0.021351    7.8668      0.287655    0.00939138  
thread23::tanh                                    1841        398.78      0.0789      0.992387    0.21661     0.00902871  
thread23::tanh_grad                               1771        278.998     0.060921    0.450839    0.157537    0.00631675  
thread23::scale                                   640         268.167     0.0704      5.97407     0.419011    0.00607152  
thread23::broadcast                               115         209.938     0.164276    4.04969     1.82555     0.00475317  
thread23::cos_sim_grad                            143         144.262     0.206992    5.82457     1.00882     0.0032662   
thread23::reduce                                  105         138.416     0.220743    21.0082     1.31825     0.00313386  
thread23::read                                    356         86.7979     0.039844    1.18606     0.243814    0.00196518  
thread23::auc                                     325         54.3156     0.05991     1.13214     0.167125    0.00122975  
thread23::cos_sim                                 154         51.9684     0.146442    1.84068     0.337457    0.00117661  
thread23::split_selected_rows                     8           47.6175     2.93959     12.3649     5.95219     0.0010781   
thread23::cast                                    157         40.3018     0.014366    2.39133     0.256699    0.000912466 
thread23::recv                                    123         34.2138     0.01745     0.960046    0.278161    0.000774629 
thread23::elementwise_mul_grad                    158         25.9892     0.032066    1.05476     0.164489    0.000588417 
thread23::elementwise_sub                         294         24.0235     0.015316    0.93897     0.0817127   0.000543913 
thread23::square_grad                             158         21.4463     0.017573    1.49274     0.135736    0.000485562 
thread23::fill_constant                           321         15.8355     0.006114    0.511964    0.0493319   0.00035853  
thread23::send_barrier                            1           15.6305     15.6305     15.6305     15.6305     0.000353889 
thread23::mean_grad                               157         15.0769     0.015782    0.743207    0.0960311   0.000341353 
thread23::elementwise_sub_grad                    141         13.7607     0.016335    0.962749    0.097594    0.000311555 
thread23::square                                  159         13.1298     0.012472    1.11909     0.0825775   0.00029727  
thread23::mean                                    168         12.4649     0.00742     0.848475    0.074196    0.000282217 
thread23::fill_constant_batch_size_like           157         11.4516     0.014066    0.547144    0.0729399   0.000259273 
thread23::elementwise_mul                         146         9.83182     0.016029    0.473233    0.0673412   0.000222601 
thread23::send                                    104         5.96769     0.015402    0.343579    0.0573816   0.000135113 
thread23::create_double_buffer_reader             148         1.1261      0.001996    0.064732    0.00760875  2.54957e-05 
thread23::split_byref                             17          0.774611    0.023288    0.095191    0.0455654   1.75378e-05 
thread22::fetch_barrier                           2           14786.6     743.354     14043.2     7393.28     0.341354    
thread22::sequence_conv_grad                      1229        4973.41     1.61787     27.6037     4.04671     0.114813    
thread22::batch_norm_grad                         632         3738.82     2.19715     18.192      5.91585     0.0863121   
thread22::mul_grad                                622         3092.59     1.65701     29.8767     4.97202     0.0713937   
thread22::batch_norm                              617         2865.72     1.76883     13.0133     4.6446      0.0661562   
thread22::sequence_conv                           1243        2689.35     0.903958    23.5036     2.1636      0.0620847   
thread22::mul                                     587         1802.24     0.796739    19.3855     3.07026     0.0416054   
thread22::sum                                     961         1533.37     0.071646    8.6024      1.5956      0.0353984   
thread22::concat                                  1153        1055.52     0.020121    27.4968     0.91546     0.0243672   
thread22::elementwise_add                         1983        914.51      0.015121    4.96554     0.461175    0.0211118   
thread22::elementwise_add_grad                    2013        866.808     0.025512    6.06976     0.430605    0.0200106   
thread22::lookup_table                            1607        719.103     0.021146    5.92111     0.447482    0.0166008   
thread22::sequence_pool                           1263        710.016     0.213486    23.5061     0.562167    0.016391    
thread22::sequence_pool_grad                      1223        509.589     0.054652    7.62606     0.416671    0.0117641   
thread22::concat_grad                             980         503.94      0.111721    5.14636     0.514224    0.0116336   
thread22::lookup_table_grad                       1744        458.777     0.024259    6.88037     0.26306     0.0105911   
thread22::tanh                                    1956        420.797     0.084793    0.670876    0.215131    0.00971425  
thread22::tanh_grad                               1901        309.951     0.060685    2.49512     0.163046    0.00715533  
thread22::scale                                   630         258.517     0.063831    7.11118     0.410345    0.00596798  
thread22::broadcast                               116         202.658     0.161211    7.95325     1.74705     0.00467843  
thread22::reduce                                  101         168.375     0.213942    24.9698     1.66708     0.003887    
thread22::cos_sim_grad                            149         166.679     0.195302    9.74924     1.11865     0.00384784  
thread22::read                                    346         91.5511     0.037408    1.66717     0.264599    0.00211349  
thread22::split_selected_rows                     9           80.9572     3.46815     23.3136     8.99525     0.00186893  
thread22::auc                                     307         55.4321     0.05891     1.19911     0.180561    0.00127967  
thread22::cos_sim                                 161         51.3252     0.151488    0.92215     0.31879     0.00118486  
thread22::cast                                    148         41.123      0.012856    2.13414     0.277858    0.000949341 
thread22::send_barrier                            1           41.0655     41.0655     41.0655     41.0655     0.000948013 
thread22::recv                                    113         34.3274     0.016812    0.964399    0.303783    0.000792462 
thread22::elementwise_mul_grad                    179         26.7179     0.029269    0.920049    0.149262    0.000616792 
thread22::elementwise_sub                         295         24.7606     0.014389    0.790467    0.0839343   0.000571609 
thread22::mean_grad                               167         16.9989     0.017961    1.05374     0.10179     0.000392427 
thread22::mean                                    170         15.9548     0.007407    1.45864     0.0938516   0.000368322 
thread22::elementwise_sub_grad                    154         15.848      0.017855    1.11728     0.102909    0.000365856 
thread22::square                                  160         15.244      0.011063    0.993673    0.095275    0.000351914 
thread22::fill_constant                           326         13.3108     0.005138    0.411298    0.0408307   0.000307285 
thread22::square_grad                             138         12.8418     0.019366    0.502885    0.0930566   0.000296458 
thread22::fill_constant_batch_size_like           155         12.5914     0.011875    0.757514    0.0812346   0.000290677 
thread22::elementwise_mul                         144         10.336      0.01674     0.472379    0.071778    0.000238611 
thread22::send                                    121         8.31588     0.013572    0.486433    0.0687263   0.000191975 
thread22::create_double_buffer_reader             148         0.945978    0.002003    0.032536    0.00639174  2.18383e-05 
thread22::split_byref                             11          0.494659    0.027059    0.077313    0.044969    1.14194e-05 
thread21::fetch_barrier                           3           43097.6     13512       15105.3     14365.9     0.600079    
thread21::sequence_conv_grad                      1301        5043.71     1.61192     26.0295     3.8768      0.0702271   
thread21::batch_norm_grad                         638         3593.27     2.19122     19.0793     5.63208     0.0500316   
thread21::mul_grad                                645         3324.28     1.62898     23.5786     5.15392     0.0462863   
thread21::batch_norm                              645         2762.26     1.77527     11.2422     4.28258     0.0384609   
thread21::sequence_conv                           1272        2679.1      0.886793    24.0392     2.10621     0.037303    
thread21::mul                                     634         1755.83     0.792558    14.2177     2.76945     0.0244477   
thread21::sum                                     897         1365.44     0.075036    9.90257     1.52223     0.0190119   
thread21::concat                                  1161        1121.15     0.02052     33.9789     0.965675    0.0156105   
thread21::elementwise_add                         2072        1037.15     0.015222    6.70529     0.500553    0.0144409   
thread21::elementwise_add_grad                    2042        874.969     0.020921    9.61363     0.428486    0.0121828   
thread21::lookup_table                            1710        737.942     0.020939    5.27826     0.431545    0.0102749   
thread21::sequence_pool                           1233        698.425     0.220104    23.4903     0.566444    0.00972466  
thread21::lookup_table_grad                       1626        489.865     0.023096    7.62985     0.30127     0.00682074  
thread21::sequence_pool_grad                      1158        479.28      0.050558    4.58579     0.413886    0.00667335  
thread21::concat_grad                             908         473.634     0.108708    4.10501     0.521624    0.00659474  
thread21::tanh                                    1969        405.569     0.082105    0.620371    0.205977    0.00564702  
thread21::scale                                   667         309.897     0.057911    7.42818     0.464613    0.00431491  
thread21::tanh_grad                               1794        282.393     0.064327    0.755469    0.15741     0.00393195  
thread21::send_barrier                            5           228.128     17.1887     78.7203     45.6256     0.00317639  
thread21::broadcast                               114         199.341     0.239162    4.67631     1.74861     0.00277557  
thread21::reduce                                  117         181.01      0.205287    16.8949     1.54709     0.00252033  
thread21::cos_sim_grad                            146         141.273     0.182995    6.5603      0.967627    0.00196705  
thread21::split_selected_rows                     14          101.755     2.71536     12.9298     7.26823     0.00141681  
thread21::read                                    308         82.4812     0.045187    1.18173     0.267796    0.00114844  
thread21::auc                                     319         55.9328     0.050707    0.856863    0.175338    0.000778792 
thread21::cos_sim                                 138         45.0327     0.153096    2.27312     0.326324    0.000627022 
thread21::cast                                    151         39.6235     0.014561    2.24225     0.262407    0.000551705 
thread21::recv                                    115         33.5653     0.019093    0.909094    0.291872    0.000467353 
thread21::elementwise_mul_grad                    175         32.4755     0.029163    1.19356     0.185574    0.000452179 
thread21::elementwise_sub                         316         25.6867     0.012984    0.934494    0.0812869   0.000357653 
thread21::square_grad                             167         18.8069     0.018373    0.888387    0.112616    0.000261861 
thread21::mean_grad                               154         16.1777     0.017966    0.786458    0.10505     0.000225253 
thread21::elementwise_sub_grad                    149         14.1615     0.016446    0.821828    0.0950437   0.000197181 
thread21::square                                  160         13.7496     0.011698    1.00944     0.0859352   0.000191446 
thread21::fill_constant                           301         13.5626     0.004613    0.837845    0.0450585   0.000188842 
thread21::fill_constant_batch_size_like           164         12.9193     0.012481    0.617164    0.0787764   0.000179885 
thread21::mean                                    159         12.0693     0.007209    1.21087     0.0759074   0.000168049 
thread21::elementwise_mul                         164         11.5726     0.015035    0.578837    0.0705647   0.000161134 
thread21::send                                    113         6.95984     0.010235    0.314117    0.0615915   9.69067e-05 
thread21::create_double_buffer_reader             161         1.14071     0.001832    0.035779    0.00708519  1.5883e-05  
thread21::split_byref                             14          0.750289    0.025825    0.091782    0.0535921   1.04468e-05 
thread20::fetch_barrier                           2           32484       15390.8     17093.3     16242       0.531606    
thread20::sequence_conv_grad                      1195        4946.23     1.65047     16.3378     4.1391      0.0809458   
thread20::batch_norm_grad                         592         3913.02     2.19879     22.7582     6.60982     0.0640371   
thread20::mul_grad                                567         3027.3      1.67098     22.5027     5.33916     0.0495423   
thread20::batch_norm                              591         2945.95     1.80455     13.4534     4.98468     0.0482109   
thread20::sequence_conv                           1187        2711.58     0.823018    24.5853     2.2844      0.0443754   
thread20::mul                                     590         1835.91     0.793663    14.3754     3.11171     0.0300449   
thread20::sum                                     1018        1678.85     0.08516     11.4555     1.64917     0.0274747   
thread20::concat                                  1122        959.818     0.017109    14.0732     0.855452    0.0157076   
thread20::elementwise_add                         1837        917.837     0.015855    7.09625     0.499639    0.0150205   
thread20::elementwise_add_grad                    2000        825.078     0.024269    10.5409     0.412539    0.0135025   
thread20::lookup_table                            1777        718.606     0.021023    5.97393     0.404393    0.0117601   
thread20::sequence_pool                           1236        715.487     0.221941    23.672      0.578873    0.0117091   
thread20::sequence_pool_grad                      1209        468.712     0.061316    6.35119     0.387685    0.00767054  
thread20::concat_grad                             856         466.591     0.112875    3.67769     0.545083    0.00763584  
thread20::lookup_table_grad                       1576        446.736     0.026273    6.63615     0.283462    0.00731091  
thread20::tanh                                    1682        379.625     0.075994    0.76396     0.225699    0.00621263  
thread20::tanh_grad                               1799        296.848     0.062923    0.868369    0.165007    0.00485796  
thread20::broadcast                               118         212.588     0.148947    7.46339     1.80159     0.00347903  
thread20::scale                                   491         200.677     0.069581    6.6348      0.408712    0.00328412  
thread20::cos_sim_grad                            149         148.464     0.222818    13.8599     0.996402    0.00242963  
thread20::send_barrier                            4           148.398     25.6223     52.7368     37.0994     0.00242855  
thread20::reduce                                  124         145.19      0.164745    16.593      1.17088     0.00237605  
thread20::read                                    276         91.3446     0.037494    1.23626     0.330959    0.00149487  
thread20::split_selected_rows                     7           65.8375     2.82463     16.2701     9.40536     0.00107744  
thread20::auc                                     313         56.4912     0.051144    1.07221     0.180483    0.000924487 
thread20::cast                                    159         49.1462     0.015337    3.85186     0.309096    0.000804285 
thread20::cos_sim                                 144         47.2949     0.154546    0.936602    0.328437    0.000773988 
thread20::recv                                    114         37.0336     0.010092    0.9768      0.324857    0.000606061 
thread20::elementwise_mul_grad                    158         27.6752     0.030488    2.36265     0.175159    0.000452909 
thread20::elementwise_sub                         317         27.1023     0.014944    1.09668     0.0854962   0.000443533 
thread20::square_grad                             172         15.5784     0.017796    0.945176    0.0905722   0.000254943 
thread20::mean                                    150         14.6372     0.008604    1.17129     0.0975816   0.000239541 
thread20::fill_constant                           302         13.7302     0.005231    0.567153    0.0454641   0.000224696 
thread20::elementwise_sub_grad                    163         13.2582     0.014845    0.654758    0.0813386   0.000216972 
thread20::elementwise_mul                         156         13.0262     0.014858    0.785309    0.0835014   0.000213176 
thread20::fill_constant_batch_size_like           149         11.6114     0.011643    1.14918     0.0779291   0.000190023 
thread20::mean_grad                               135         11.5551     0.017356    0.722703    0.0855936   0.000189102 
thread20::square                                  128         8.34403     0.010894    0.486216    0.0651878   0.000136551 
thread20::send                                    108         6.80019     0.011454    0.350981    0.0629647   0.000111286 
thread20::create_double_buffer_reader             136         0.828601    0.002411    0.029964    0.00609265  1.35602e-05 
thread20::split_byref                             14          0.63609     0.028396    0.070724    0.045435    1.04097e-05 
thread19::fetch_barrier                           4           29029.1     694.832     14407.1     7257.28     0.502734    
thread19::sequence_conv_grad                      1212        5016.36     1.63789     21.0147     4.13891     0.0868746   
thread19::batch_norm_grad                         595         3631.33     2.20523     26.3294     6.10308     0.0628884   
thread19::mul_grad                                631         3265.16     1.66204     22.8498     5.17458     0.0565469   
thread19::batch_norm                              597         2827.81     1.777       15.6547     4.73669     0.0489727   
thread19::sequence_conv                           1212        2666.31     0.850494    23.9727     2.19993     0.0461759   
thread19::mul                                     609         1810.62     0.800135    14.9569     2.97311     0.0313568   
thread19::sum                                     953         1477.14     0.066221    10.074      1.54999     0.0255815   
thread19::concat                                  1054        993.09      0.02061     19.5554     0.94221     0.0171986   
thread19::elementwise_add                         2055        960.38      0.016906    7.3865      0.467338    0.0166321   
thread19::elementwise_add_grad                    2027        880.131     0.02392     8.99847     0.434204    0.0152423   
thread19::lookup_table                            1771        723.131     0.020133    4.70301     0.408318    0.0125234   
thread19::sequence_pool                           1226        710.665     0.214332    23.9766     0.579662    0.0123075   
thread19::sequence_pool_grad                      1262        473.038     0.064279    8.04285     0.374832    0.0081922   
thread19::concat_grad                             920         463.55      0.111541    4.14919     0.503859    0.00802788  
thread19::lookup_table_grad                       1669        416.251     0.025078    5.71805     0.249401    0.00720875  
thread19::tanh                                    1838        409.06      0.082525    0.662879    0.222557    0.00708421  
thread19::tanh_grad                               1885        314.329     0.058754    2.51797     0.166753    0.00544363  
thread19::scale                                   652         286.677     0.07041     6.85159     0.439689    0.00496475  
thread19::reduce                                  125         266.04      0.19716     19.4594     2.12832     0.00460736  
thread19::send_barrier                            4           227.197     45.9536     69.7229     56.7991     0.00393465  
thread19::broadcast                               112         223.76      0.135387    7.8345      1.99786     0.00387514  
thread19::cos_sim_grad                            147         174.189     0.206784    9.95114     1.18496     0.00301665  
thread19::read                                    274         88.7967     0.036668    1.21082     0.324076    0.00153781  
thread19::auc                                     309         53.7931     0.05167     0.848008    0.174088    0.000931603 
thread19::split_selected_rows                     9           52.4788     2.69046     11.1881     5.83098     0.000908842 
thread19::cos_sim                                 157         50.3278     0.15827     1.54877     0.32056     0.000871591 
thread19::cast                                    159         45.5306     0.014469    5.92833     0.286356    0.000788512 
thread19::recv                                    117         32.552      0.016257    0.95485     0.278223    0.000563745 
thread19::elementwise_mul_grad                    165         32.4768     0.029171    2.08277     0.196829    0.000562441 
thread19::elementwise_sub                         301         23.701      0.012919    1.00067     0.0787408   0.00041046  
thread19::square_grad                             149         16.2087     0.019526    0.949605    0.108783    0.000280706 
thread19::mean_grad                               150         15.1723     0.01785     0.872939    0.101149    0.000262758 
thread19::fill_constant                           317         15.0497     0.005363    0.551595    0.0474754   0.000260635 
thread19::elementwise_sub_grad                    147         14.1757     0.014315    1.0625      0.0964332   0.000245498 
thread19::elementwise_mul                         154         14.101      0.017836    1.15315     0.0915652   0.000244206 
thread19::square                                  154         11.1589     0.012304    0.754724    0.0724602   0.000193252 
thread19::mean                                    140         10.383      0.008666    0.597313    0.0741645   0.000179816 
thread19::fill_constant_batch_size_like           128         9.6085      0.011273    0.94739     0.0750664   0.000166403 
thread19::send                                    117         9.60774     0.014536    2.54194     0.0821174   0.000166389 
thread19::create_double_buffer_reader             181         1.08848     0.001847    0.03621     0.00601373  1.88507e-05 
thread19::split_byref                             19          0.943578    0.029156    0.089259    0.049662    1.63411e-05 
thread18::sequence_conv_grad                      1283        4889.44     1.59985     22.6447     3.81094     0.167788    
thread18::batch_norm_grad                         656         3486.26     2.18306     19.9326     5.31442     0.119636    
thread18::mul_grad                                701         3364.22     1.64713     24.1475     4.79917     0.115448    
thread18::batch_norm                              664         2753.1      1.77211     11.4267     4.14624     0.0944768   
thread18::sequence_conv                           1335        2680.15     0.849999    24.0253     2.0076      0.0919733   
thread18::mul                                     647         1786.17     0.797252    19.5658     2.76069     0.0612949   
thread18::sum                                     997         1521.71     0.072455    9.45782     1.52629     0.0522198   
thread18::concat                                  1171        1090.52     0.022406    27.5154     0.931272    0.0374228   
thread18::elementwise_add                         2136        973.014     0.016123    6.74549     0.455531    0.0333904   
thread18::elementwise_add_grad                    2087        906.736     0.022122    7.79538     0.434469    0.031116    
thread18::lookup_table                            1750        734.438     0.022908    5.98871     0.419679    0.0252033   
thread18::sequence_pool                           1273        694.163     0.217895    23.6397     0.545297    0.0238212   
thread18::sequence_pool_grad                      1293        505.837     0.058987    6.22779     0.391212    0.0173585   
thread18::concat_grad                             968         487.321     0.11239     5.53998     0.503431    0.0167231   
thread18::lookup_table_grad                       1798        477.427     0.021234    5.94184     0.265532    0.0163836   
thread18::fetch_barrier                           1           435.134     435.134     435.134     435.134     0.0149323   
thread18::tanh                                    1984        409.984     0.077096    0.668421    0.206645    0.0140692   
thread18::tanh_grad                               1888        302.635     0.066295    1.40691     0.160294    0.0103854   
thread18::scale                                   695         292.926     0.072012    3.98964     0.421476    0.0100522   
thread18::send_barrier                            4           253.891     43.5421     98.0797     63.4728     0.00871265  
thread18::broadcast                               121         213.104     0.16573     5.15323     1.76119     0.00731296  
thread18::reduce                                  108         210.507     0.206602    23.0571     1.94914     0.00722385  
thread18::cos_sim_grad                            163         169.96      0.218728    7.2493      1.0427      0.00583244  
thread18::read                                    340         84.8192     0.040729    2.33291     0.249468    0.00291069  
thread18::split_selected_rows                     10          76.0063     4.43824     10.6106     7.60063     0.00260827  
thread18::auc                                     308         56.8706     0.057807    1.36348     0.184645    0.0019516   
thread18::cos_sim                                 161         50.5872     0.150065    1.50996     0.314206    0.00173597  
thread18::cast                                    161         35.768      0.015962    1.82048     0.222161    0.00122743  
thread18::recv                                    111         32.7464     0.016446    1.02713     0.295012    0.00112374  
thread18::elementwise_mul_grad                    157         27.1668     0.028093    1.34278     0.173037    0.000932269 
thread18::elementwise_sub                         310         26.3509     0.014819    0.887087    0.0850029   0.000904269 
thread18::elementwise_sub_grad                    175         15.0321     0.015779    1.14287     0.0858978   0.000515849 
thread18::mean_grad                               164         14.399      0.01902     0.811885    0.0877989   0.000494123 
thread18::mean                                    178         13.6458     0.006926    0.772702    0.0766616   0.000468274 
thread18::square                                  178         13.3084     0.010695    0.77262     0.0747663   0.000456697 
thread18::square_grad                             150         12.8707     0.019083    1.04247     0.0858048   0.000441678 
thread18::fill_constant                           291         12.3099     0.005418    0.807463    0.0423019   0.000422431 
thread18::fill_constant_batch_size_like           138         11.1599     0.011623    0.490275    0.0808685   0.000382967 
thread18::elementwise_mul                         142         10.5333     0.018157    0.820505    0.0741784   0.000361466 
thread18::send                                    120         6.31916     0.012475    0.304842    0.0526597   0.000216851 
thread18::create_double_buffer_reader             160         1.07992     0.001885    0.048253    0.00674953  3.70592e-05 
thread18::split_byref                             23          0.920456    0.024903    0.089584    0.0400198   3.15868e-05 
thread17::fetch_barrier                           1           13792.4     13792.4     13792.4     13792.4     0.323741    
thread17::sequence_conv_grad                      1259        5132.17     1.62182     27.9581     4.07639     0.120464    
thread17::batch_norm_grad                         619         3809.58     2.2067      18.8413     6.15441     0.08942     
thread17::mul_grad                                589         3073.5      1.68779     25.6828     5.21817     0.0721426   
thread17::batch_norm                              614         2794.58     1.77637     12.1982     4.55143     0.0655955   
thread17::sequence_conv                           1244        2689.63     0.895704    23.1494     2.16208     0.0631321   
thread17::mul                                     637         1883.09     0.797644    17.8217     2.95618     0.0442006   
thread17::sum                                     908         1470        0.077417    8.84333     1.61894     0.0345045   
thread17::concat                                  1093        1030.72     0.018862    29.2485     0.943023    0.0241936   
thread17::elementwise_add                         1996        962.778     0.015652    8.02674     0.482354    0.0225987   
thread17::elementwise_add_grad                    1996        796.214     0.020176    7.1779      0.398905    0.0186891   
thread17::lookup_table                            1655        732.101     0.022726    6.29599     0.442357    0.0171842   
thread17::sequence_pool                           1226        721.587     0.221589    23.7239     0.58857     0.0169374   
thread17::concat_grad                             897         491.312     0.11471     4.03648     0.547728    0.0115323   
thread17::sequence_pool_grad                      1246        462.505     0.055023    6.25459     0.371192    0.0108561   
thread17::lookup_table_grad                       1595        454.61      0.022855    10.5289     0.285022    0.0106708   
thread17::tanh                                    1872        406.328     0.075647    0.855471    0.217055    0.00953749  
thread17::tanh_grad                               1926        320.522     0.064224    1.13814     0.166419    0.00752343  
thread17::send_barrier                            5           284.243     23.279      85.5019     56.8486     0.00667187  
thread17::scale                                   715         278.659     0.070483    5.46589     0.389733    0.0065408   
thread17::broadcast                               102         178.489     0.181334    4.96293     1.7499      0.00418958  
thread17::cos_sim_grad                            156         145.74      0.212049    7.024       0.934234    0.00342088  
thread17::reduce                                  96          133.023     0.22829     11.2894     1.38565     0.00312237  
thread17::split_selected_rows                     12          113.68      5.1527      25.6101     9.47333     0.00266834  
thread17::read                                    290         80.5073     0.036007    1.25773     0.277612    0.0018897   
thread17::auc                                     321         60.9137     0.060799    1.56961     0.189762    0.00142979  
thread17::cos_sim                                 167         53.228      0.152368    1.17021     0.318731    0.00124939  
thread17::cast                                    156         41.0111     0.013253    2.69204     0.262891    0.000962629 
thread17::recv                                    105         38.3102     0.021659    0.994597    0.364859    0.000899234 
thread17::elementwise_sub                         326         29.498      0.014647    1.22714     0.0904847   0.00069239  
thread17::elementwise_mul_grad                    147         25.5183     0.024613    1.30679     0.173594    0.000598977 
thread17::square_grad                             151         15.7288     0.01585     0.709834    0.104164    0.000369194 
thread17::elementwise_sub_grad                    157         15.2698     0.016281    1.02174     0.0972598   0.000358419 
thread17::fill_constant                           328         14.5891     0.005024    0.589343    0.0444788   0.000342441 
thread17::square                                  155         14.3665     0.012528    1.2739      0.0926869   0.000337216 
thread17::elementwise_mul                         178         13.9116     0.016058    0.543435    0.0781548   0.000326538 
thread17::mean                                    166         13.4263     0.010344    0.685308    0.0808811   0.000315147 
thread17::mean_grad                               142         12.4334     0.019817    0.816257    0.0875591   0.000291842 
thread17::fill_constant_batch_size_like           137         9.32519     0.014808    0.640509    0.0680671   0.000218885 
thread17::send                                    98          5.45294     0.011199    0.503958    0.0556422   0.000127994 
thread17::create_double_buffer_reader             161         1.16345     0.002039    0.083364    0.0072264   2.7309e-05  
thread17::split_byref                             19          1.09677     0.029116    0.141838    0.0577247   2.57438e-05 
thread16::sequence_conv_grad                      1234        5068.08     1.5901      21.3976     4.10703     0.172471    
thread16::batch_norm_grad                         584         3663.71     2.17496     18.2359     6.27347     0.124679    
thread16::mul_grad                                624         3264.71     1.64975     28.2693     5.23191     0.111101    
thread16::batch_norm                              624         2835.74     1.78641     11.8513     4.54445     0.0965024   
thread16::sequence_conv                           1205        2664.63     0.829259    23.6789     2.21131     0.0906793   
thread16::mul                                     581         1745.44     0.78887     15.6111     3.0042      0.0593986   
thread16::sum                                     917         1510.83     0.075098    9.22761     1.64757     0.0514146   
thread16::concat                                  1111        1039.28     0.020066    26.2739     0.935445    0.0353675   
thread16::elementwise_add                         2005        1006.38     0.016635    8.10163     0.501937    0.034248    
thread16::elementwise_add_grad                    1948        842.286     0.022934    10.0185     0.432385    0.0286636   
thread16::fetch_barrier                           1           815.554     815.554     815.554     815.554     0.027754    
thread16::lookup_table                            1790        730.096     0.020958    6.01011     0.407875    0.0248457   
thread16::sequence_pool                           1266        722.255     0.210311    23.5631     0.570501    0.0245789   
thread16::concat_grad                             940         471.088     0.117968    3.53152     0.501157    0.0160315   
thread16::sequence_pool_grad                      1247        446.075     0.063025    5.57028     0.357719    0.0151803   
thread16::lookup_table_grad                       1786        439.48      0.026277    5.27401     0.24607     0.0149559   
thread16::tanh                                    1873        412.731     0.084586    0.640052    0.220358    0.0140456   
thread16::tanh_grad                               1835        303.294     0.061538    0.632217    0.165283    0.0103213   
thread16::scale                                   581         243.337     0.066817    4.22531     0.418824    0.00828094  
thread16::broadcast                               127         219.656     0.149015    5.68518     1.72958     0.00747507  
thread16::reduce                                  113         179.939     0.204034    17.7762     1.59238     0.00612348  
thread16::cos_sim_grad                            149         145.174     0.192264    9.95585     0.974321    0.00494038  
thread16::send_barrier                            3           116.097     18.4953     70.2662     38.6989     0.00395086  
thread16::read                                    274         91.7951     0.042797    1.39986     0.335019    0.00312386  
thread16::auc                                     295         50.5177     0.056365    0.667877    0.171246    0.00171916  
thread16::cos_sim                                 153         48.8883     0.156588    1.74156     0.319531    0.00166371  
thread16::split_selected_rows                     5           46.7155     5.59107     20.7688     9.3431      0.00158977  
thread16::cast                                    159         43.7482     0.016281    2.4619      0.275146    0.00148879  
thread16::recv                                    114         36.1554     0.017134    0.996306    0.317152    0.0012304   
thread16::elementwise_mul_grad                    155         29.9651     0.029798    1.62637     0.193323    0.00101974  
thread16::elementwise_sub                         337         25.7691     0.014028    1.17152     0.0764663   0.000876944 
thread16::elementwise_sub_grad                    177         18.449      0.014027    1.02732     0.104232    0.000627834 
thread16::fill_constant_batch_size_like           161         15.0814     0.010985    0.865759    0.0936733   0.000513232 
thread16::fill_constant                           329         14.8989     0.004834    0.618648    0.0452855   0.000507022 
thread16::mean_grad                               165         14.8855     0.017274    0.721135    0.0902151   0.000506565 
thread16::square_grad                             147         14.34       0.019057    0.688626    0.097551    0.000488001 
thread16::elementwise_mul                         171         14.2461     0.019459    0.942034    0.0833104   0.000484805 
thread16::square                                  184         13.3345     0.011366    0.68369     0.0724703   0.000453785 
thread16::mean                                    159         12.1153     0.007243    0.60343     0.0761968   0.000412293 
thread16::send                                    98          7.02674     0.013411    0.38939     0.0717014   0.000239125 
thread16::create_double_buffer_reader             125         0.918931    0.002346    0.039607    0.00735145  3.12719e-05 
thread16::split_byref                             10          0.464999    0.031692    0.074876    0.0464999   1.58243e-05 
thread15::fetch_barrier                           4           28538       699.537     13611.6     7134.49     0.499558    
thread15::sequence_conv_grad                      1226        4908.07     1.67294     19.4151     4.00332     0.085916    
thread15::batch_norm_grad                         653         3880.79     2.19572     25.2117     5.94302     0.0679334   
thread15::mul_grad                                621         2980.78     1.65075     41.2276     4.79997     0.0521787   
thread15::batch_norm                              641         2724.24     1.7821      11.7414     4.24999     0.047688    
thread15::sequence_conv                           1308        2657.41     0.794802    23.4899     2.03166     0.0465181   
thread15::mul                                     680         1884.38     0.790369    17.0555     2.77115     0.0329862   
thread15::sum                                     974         1558.13     0.069512    8.59767     1.59972     0.0272751   
thread15::concat                                  1123        1103.68     0.019196    26.926      0.9828      0.0193201   
thread15::elementwise_add                         2236        978.01      0.015688    5.49389     0.437393    0.0171201   
thread15::elementwise_add_grad                    2161        904.176     0.025663    6.99672     0.418406    0.0158276   
thread15::lookup_table                            1730        734.793     0.023187    5.58529     0.424736    0.0128626   
thread15::sequence_pool                           1278        710.625     0.212032    23.5593     0.556045    0.0124395   
thread15::sequence_pool_grad                      1285        517.193     0.055214    7.18704     0.402485    0.00905349  
thread15::concat_grad                             930         478.552     0.118851    5.88573     0.514572    0.00837708  
thread15::lookup_table_grad                       1713        457.079     0.02204     4.8127      0.26683     0.00800119  
thread15::tanh                                    1981        420.033     0.082785    0.612286    0.212031    0.0073527   
thread15::tanh_grad                               1995        331.735     0.060455    0.930644    0.166283    0.00580703  
thread15::scale                                   611         238.216     0.071588    4.58754     0.389879    0.00416999  
thread15::broadcast                               114         202.53      0.262423    5.69124     1.77658     0.0035453   
thread15::reduce                                  119         199.016     0.196622    21.1522     1.6724      0.00348379  
thread15::cos_sim_grad                            142         140.976     0.203025    4.23323     0.992785    0.00246778  
thread15::read                                    292         94.9663     0.040232    2.19612     0.325227    0.00166239  
thread15::send_barrier                            1           75.7509     75.7509     75.7509     75.7509     0.00132602  
thread15::split_selected_rows                     9           58.4094     2.75547     10.1974     6.48993     0.00102246  
thread15::cos_sim                                 160         52.1398     0.148024    2.13756     0.325874    0.000912709 
thread15::auc                                     278         50.0978     0.049326    0.850952    0.180208    0.000876963 
thread15::cast                                    160         37.339      0.013909    3.56748     0.233369    0.000653621 
thread15::recv                                    111         32.4296     0.010139    0.967537    0.292158    0.000567681 
thread15::elementwise_sub                         338         28.9574     0.013848    0.846196    0.0856729   0.000506901 
thread15::elementwise_mul_grad                    152         25.6778     0.029086    1.20453     0.168933    0.000449491 
thread15::mean_grad                               154         17.7711     0.019544    0.772523    0.115397    0.000311084 
thread15::elementwise_sub_grad                    166         16.3466     0.013365    0.768063    0.0984732   0.000286147 
thread15::fill_constant                           330         15.1485     0.005467    0.436224    0.0459045   0.000265175 
thread15::square_grad                             140         15.107      0.01817     0.936365    0.107907    0.00026445  
thread15::mean                                    168         14.5903     0.009699    0.671572    0.086847    0.000255404 
thread15::square                                  167         14.574      0.01129     1.16623     0.0872696   0.000255119 
thread15::elementwise_mul                         146         10.4047     0.01697     0.487965    0.0712648   0.000182134 
thread15::fill_constant_batch_size_like           154         9.39132     0.012448    0.583613    0.0609826   0.000164395 
thread15::send                                    118         7.35312     0.010528    0.356032    0.0623145   0.000128717 
thread15::create_double_buffer_reader             162         1.05682     0.001841    0.042849    0.00652358  1.84997e-05 
thread15::split_byref                             9           0.489016    0.031891    0.096926    0.0543351   8.56025e-06 
thread14::fetch_barrier                           3           28805.8     598.42      14609.6     9601.95     0.502786    
thread14::sequence_conv_grad                      1307        5028.04     1.62059     23.1688     3.84701     0.087761    
thread14::batch_norm_grad                         655         3639.91     2.19145     20.5552     5.55712     0.0635322   
thread14::mul_grad                                618         3161.16     1.65234     23.809      5.11514     0.0551759   
thread14::batch_norm                              642         2819.97     1.78863     13.5083     4.39248     0.0492207   
thread14::sequence_conv                           1294        2656.66     0.805601    23.5749     2.05306     0.0463701   
thread14::mul                                     632         1884.29     0.791656    16.149      2.98147     0.032889    
thread14::sum                                     915         1418.96     0.073638    7.69184     1.55077     0.0247669   
thread14::concat                                  1109        970.6       0.019083    14.4647     0.875202    0.0169412   
thread14::elementwise_add                         2060        945.667     0.015598    8.56758     0.459062    0.016506    
thread14::elementwise_add_grad                    2156        902.773     0.023717    6.58863     0.418726    0.0157573   
thread14::lookup_table                            1742        739.098     0.022104    5.1861      0.424281    0.0129005   
thread14::sequence_pool                           1261        713.477     0.214482    23.6525     0.565802    0.0124533   
thread14::sequence_pool_grad                      1371        524.463     0.058202    5.60846     0.38254     0.00915414  
thread14::lookup_table_grad                       1874        476.327     0.023088    6.89149     0.254177    0.00831398  
thread14::concat_grad                             982         469.453     0.112823    4.48988     0.478058    0.00819398  
thread14::tanh                                    1853        395.85      0.077268    0.65027     0.213627    0.0069093   
thread14::tanh_grad                               1974        318.616     0.066807    3.15411     0.161406    0.00556123  
thread14::scale                                   676         260.939     0.066844    6.41976     0.386004    0.00455451  
thread14::broadcast                               112         213.799     0.150107    5.30079     1.90892     0.00373172  
thread14::reduce                                  122         169.356     0.166509    16.1039     1.38816     0.002956    
thread14::cos_sim_grad                            178         160.856     0.18307     3.8452      0.903687    0.00280764  
thread14::send_barrier                            2           122.647     27.5014     95.1461     61.3237     0.00214073  
thread14::read                                    338         88.7393     0.035838    1.29777     0.262542    0.00154888  
thread14::auc                                     345         61.961      0.056183    1.0282      0.179597    0.00108149  
thread14::cos_sim                                 152         50.8639     0.148356    1.19378     0.334631    0.000887796 
thread14::split_selected_rows                     6           46.867      3.58268     14.5785     7.81117     0.000818032 
thread14::cast                                    151         41.0037     0.016004    2.67795     0.271548    0.000715691 
thread14::recv                                    111         31.1804     0.0151      0.989183    0.280905    0.000544234 
thread14::elementwise_mul_grad                    170         26.51       0.030063    0.722098    0.155941    0.000462715 
thread14::elementwise_sub                         311         25.8702     0.014587    0.887262    0.0831838   0.000451546 
thread14::elementwise_sub_grad                    188         19.3118     0.011793    1.267       0.102722    0.000337074 
thread14::mean_grad                               147         16.4179     0.019592    0.999505    0.111687    0.000286564 
thread14::fill_constant                           326         15.3212     0.005819    0.653809    0.0469975   0.000267421 
thread14::square_grad                             149         12.4598     0.020187    0.922629    0.0836225   0.000217477 
thread14::square                                  163         12.3843     0.010943    0.952218    0.0759771   0.000216159 
thread14::fill_constant_batch_size_like           169         11.8987     0.011689    0.567329    0.0704067   0.000207684 
thread14::mean                                    157         11.5393     0.00861     0.820177    0.0734984   0.00020141  
thread14::elementwise_mul                         156         10.2313     0.016927    0.544329    0.0655852   0.00017858  
thread14::send                                    120         8.4821      0.012462    0.66202     0.0706842   0.000148049 
thread14::split_byref                             29          1.30912     0.024397    0.094548    0.0451419   2.28497e-05 
thread14::create_double_buffer_reader             154         1.29751     0.001968    0.123469    0.00842538  2.26471e-05 
thread13::fetch_barrier                           5           47342.1     678.779     15462.1     9468.41     0.624303    
thread13::sequence_conv_grad                      1279        4949.08     1.61172     21.4853     3.86949     0.0652638   
thread13::batch_norm_grad                         669         3611.42     2.19495     19.2016     5.39824     0.047624    
thread13::mul_grad                                618         3184.3      1.637       23.4244     5.15259     0.0419916   
thread13::batch_norm                              664         2733.57     1.82008     11.8729     4.11683     0.0360478   
thread13::sequence_conv                           1320        2656.05     0.838995    16.1645     2.01216     0.0350255   
thread13::mul                                     699         1828.35     0.791065    14.4808     2.61566     0.0241106   
thread13::sum                                     977         1590.04     0.086002    8.49257     1.62747     0.0209679   
thread13::elementwise_add                         2204        1045.33     0.016958    22.7691     0.474288    0.0137848   
thread13::concat                                  1072        1024.9      0.020397    27.5078     0.956067    0.0135155   
thread13::elementwise_add_grad                    2028        874.373     0.023593    5.53368     0.43115     0.0115304   
thread13::lookup_table                            1699        720.678     0.020974    4.72891     0.424178    0.00950363  
thread13::sequence_pool                           1298        710.753     0.212638    23.8921     0.547575    0.00937274  
thread13::concat_grad                             1009        510.448     0.11512     3.96439     0.505895    0.00673132  
thread13::sequence_pool_grad                      1229        508.286     0.055244    7.02819     0.413577    0.0067028   
thread13::lookup_table_grad                       1748        466.334     0.021748    8.72934     0.266782    0.00614958  
thread13::tanh                                    2000        407.252     0.074562    1.33797     0.203626    0.00537045  
thread13::tanh_grad                               1932        311.327     0.063593    2.24798     0.161142    0.00410549  
thread13::scale                                   662         281.752     0.072002    5.36245     0.425607    0.00371548  
thread13::reduce                                  109         219.894     0.220399    25.7349     2.01737     0.00289975  
thread13::broadcast                               117         208.254     0.148068    4.66118     1.77995     0.00274625  
thread13::cos_sim_grad                            156         159.775     0.1948      8.89747     1.0242      0.00210696  
thread13::read                                    336         92.1633     0.040354    2.85077     0.274296    0.00121536  
thread13::auc                                     328         67.5141     0.052528    1.66554     0.205836    0.000890313 
thread13::split_selected_rows                     6           44.8147     2.82099     16.3935     7.46912     0.000590974 
thread13::cos_sim                                 153         44.3642     0.159415    1.017       0.289962    0.000585033 
thread13::cast                                    159         36.7063     0.014936    3.11617     0.230857    0.000484048 
thread13::recv                                    119         33.1921     0.012474    0.959421    0.278925    0.000437706 
thread13::elementwise_sub                         323         27.8602     0.014594    1.19528     0.0862545   0.000367394 
thread13::elementwise_mul_grad                    164         26.3672     0.030007    1.53316     0.160775    0.000347705 
thread13::mean_grad                               175         16.8838     0.017026    0.989554    0.096479    0.000222648 
thread13::square_grad                             162         15.5832     0.019925    0.870899    0.0961927   0.000205497 
thread13::fill_constant                           313         14.4904     0.005611    0.397676    0.0462952   0.000191086 
thread13::square                                  155         14.199      0.011461    0.84051     0.0916065   0.000187243 
thread13::elementwise_sub_grad                    150         13.9458     0.013962    0.697143    0.0929723   0.000183905 
thread13::mean                                    167         12.3102     0.00904     0.747296    0.0737135   0.000162335 
thread13::fill_constant_batch_size_like           159         10.7238     0.013344    0.719343    0.0674453   0.000141416 
thread13::elementwise_mul                         138         10.1446     0.017084    1.0272      0.0735115   0.000133777 
thread13::send                                    87          4.28645     0.010395    0.258075    0.0492695   5.65256e-05 
thread13::create_double_buffer_reader             174         1.16594     0.001253    0.048149    0.00670082  1.53754e-05 
thread13::split_byref                             19          0.938318    0.030328    0.092657    0.0493852   1.23737e-05 
thread12::fetch_barrier                           2           16885.4     635.451     16249.9     8442.68     0.37073     
thread12::sequence_conv_grad                      1289        5043.84     1.62472     17.5516     3.91299     0.110741    
thread12::batch_norm_grad                         631         3625.86     2.19348     18.6965     5.74621     0.0796084   
thread12::mul_grad                                621         3200.81     1.67376     26.0158     5.15429     0.0702762   
thread12::batch_norm                              647         2692.55     1.77723     13.5355     4.16158     0.0591168   
thread12::sequence_conv                           1331        2635.84     0.853006    24.9275     1.98035     0.0578719   
thread12::mul                                     659         1826.61     0.792017    14.4572     2.77179     0.0401046   
thread12::sum                                     876         1471.05     0.072411    11.9747     1.67928     0.0322979   
thread12::concat                                  1149        1087.79     0.020338    27.0509     0.946724    0.0238831   
thread12::elementwise_add                         2155        1018.06     0.014548    6.62965     0.472416    0.0223522   
thread12::elementwise_add_grad                    2035        868.054     0.024279    6.99659     0.426562    0.0190588   
thread12::lookup_table                            1743        760.167     0.022344    6.1278      0.436125    0.01669     
thread12::sequence_pool                           1261        707.588     0.218216    23.7041     0.561132    0.0155356   
thread12::sequence_pool_grad                      1287        486.82      0.061934    6.36729     0.378259    0.0106885   
thread12::concat_grad                             920         483.243     0.104707    4.34066     0.525264    0.0106099   
thread12::lookup_table_grad                       1681        451.957     0.02202     8.59856     0.268862    0.00992304  
thread12::tanh                                    1974        413.332     0.072926    0.641173    0.209388    0.00907501  
thread12::scale                                   643         323.11      0.070393    7.86351     0.502504    0.00709412  
thread12::tanh_grad                               1873        298.325     0.061725    1.63261     0.159277    0.00654994  
thread12::broadcast                               114         209.346     0.141323    5.51092     1.83637     0.00459635  
thread12::send_barrier                            3           202.201     36.2935     103.97      67.4004     0.00443947  
thread12::reduce                                  109         197.503     0.179334    19.7022     1.81196     0.00433632  
thread12::cos_sim_grad                            161         168.809     0.195283    7.32996     1.0485      0.00370632  
thread12::read                                    292         87.9797     0.036032    1.71051     0.3013      0.00193166  
thread12::auc                                     309         53.3957     0.046257    1.29749     0.172802    0.00117234  
thread12::split_selected_rows                     6           48.0049     2.75862     11.8476     8.00081     0.00105398  
thread12::cast                                    163         47.7612     0.014649    3.60712     0.293014    0.00104863  
thread12::cos_sim                                 154         46.6308     0.156555    1.12942     0.302798    0.00102381  
thread12::recv                                    112         35.0294     0.013687    1.02095     0.312762    0.000769096 
thread12::elementwise_mul_grad                    159         26.4996     0.030099    1.02092     0.166664    0.000581818 
thread12::elementwise_sub                         317         25.4204     0.012162    0.973159    0.0801906   0.000558124 
thread12::elementwise_sub_grad                    172         15.3915     0.013948    1.04276     0.0894856   0.000337932 
thread12::square_grad                             168         15.196      0.020816    0.89669     0.0904524   0.000333639 
thread12::elementwise_mul                         200         13.8375     0.01605     0.75418     0.0691876   0.000303813 
thread12::mean                                    148         13.4655     0.009911    0.852742    0.0909833   0.000295646 
thread12::square                                  153         13.1179     0.01156     0.657972    0.085738    0.000288013 
thread12::fill_constant                           303         12.8033     0.004943    0.412392    0.042255    0.000281105 
thread12::mean_grad                               146         12.1727     0.01971     0.490179    0.083375    0.000267262 
thread12::fill_constant_batch_size_like           140         10.6978     0.011406    0.456522    0.0764132   0.000234879 
thread12::send                                    124         8.40225     0.011551    0.461244    0.0677601   0.000184477 
thread12::create_double_buffer_reader             163         1.15944     0.002091    0.078007    0.00711312  2.54563e-05 
thread12::split_byref                             17          1.02015     0.025098    0.107562    0.0600089   2.23982e-05 
thread11::fetch_barrier                           4           45568.5     809.808     16084.1     11392.1     0.615       
thread11::sequence_conv_grad                      1249        4859.05     1.62629     23.4927     3.89035     0.0655786   
thread11::mul_grad                                676         3544.63     1.65947     29.132      5.24354     0.047839    
thread11::batch_norm_grad                         601         3429.46     2.19645     21.8815     5.70626     0.0462846   
thread11::batch_norm                              605         2733.86     1.79092     14.2844     4.51877     0.0368966   
thread11::sequence_conv                           1250        2634.88     0.820235    25.1457     2.1079      0.0355608   
thread11::mul                                     652         1895.02     0.790145    19.4952     2.90647     0.0255755   
thread11::sum                                     973         1487.5      0.067386    7.35649     1.52877     0.0200755   
thread11::concat                                  1130        1112.87     0.019077    44.3501     0.984837    0.0150194   
thread11::elementwise_add                         2012        965.653     0.017221    8.25071     0.479947    0.0130326   
thread11::elementwise_add_grad                    2096        873.467     0.023211    8.03383     0.41673     0.0117885   
thread11::lookup_table                            1711        752.819     0.022836    5.98346     0.439988    0.0101602   
thread11::sequence_pool                           1260        711.894     0.218566    23.7061     0.564995    0.00960784  
thread11::lookup_table_grad                       1761        484.998     0.021475    5.76734     0.275411    0.00654562  
thread11::sequence_pool_grad                      1251        480.758     0.055072    5.17188     0.384299    0.0064884   
thread11::concat_grad                             947         469.298     0.104643    3.91665     0.495562    0.00633372  
thread11::tanh                                    1819        388.44      0.077048    0.654295    0.213546    0.00524246  
thread11::scale                                   653         303.048     0.064814    7.57688     0.464086    0.00408999  
thread11::tanh_grad                               1844        291.265     0.062156    0.667186    0.157953    0.00393096  
thread11::broadcast                               122         211.607     0.139496    5.65774     1.73448     0.00285588  
thread11::reduce                                  118         184.836     0.145929    16.8689     1.56641     0.00249458  
thread11::cos_sim_grad                            177         179.699     0.192542    6.60584     1.01525     0.00242525  
thread11::split_selected_rows                     12          99.3336     2.99079     16.9102     8.2778      0.00134062  
thread11::read                                    338         81.4341     0.039677    1.4443      0.240929    0.00109905  
thread11::auc                                     325         56.8271     0.045439    1.29111     0.174852    0.000766948 
thread11::cos_sim                                 186         56.6724     0.152497    1.18111     0.30469     0.00076486  
thread11::cast                                    156         37.3727     0.011853    2.73271     0.239568    0.000504388 
thread11::recv                                    115         31.7564     0.01835     0.979048    0.276143    0.00042859  
thread11::elementwise_sub                         327         27.364      0.015852    0.983325    0.0836819   0.000369309 
thread11::elementwise_mul_grad                    130         19.6901     0.028662    0.759218    0.151462    0.000265741 
thread11::mean_grad                               183         17.8197     0.019616    1.06529     0.0973755   0.000240498 
thread11::square_grad                             159         15.6391     0.018209    1.05876     0.0983593   0.000211068 
thread11::fill_constant_batch_size_like           168         14.6429     0.012591    0.982678    0.0871603   0.000197624 
thread11::mean                                    161         14.2395     0.00812     0.813723    0.0884443   0.000192179 
thread11::fill_constant                           307         14.0419     0.006318    0.442073    0.0457391   0.000189512 
thread11::elementwise_sub_grad                    145         12.7879     0.016061    1.30061     0.0881922   0.000172587 
thread11::square                                  152         10.964      0.011616    1.29577     0.0721316   0.000147972 
thread11::elementwise_mul                         129         10.8097     0.017406    0.873579    0.083796    0.000145889 
thread11::send                                    132         8.37255     0.010064    0.886904    0.0634284   0.000112997 
thread11::create_double_buffer_reader             156         1.11368     0.002302    0.076595    0.00713897  1.50304e-05 
thread11::split_byref                             16          0.668356    0.018892    0.075453    0.0417723   9.02025e-06 
thread10::fetch_barrier                           3           27905.3     609.927     13970.1     9301.77     0.495614    
thread10::sequence_conv_grad                      1242        5001.15     1.65689     21.3562     4.02669     0.0888233   
thread10::batch_norm_grad                         628         3805.69     2.19921     19.6471     6.06002     0.0675912   
thread10::mul_grad                                578         3159.19     1.66225     25.5194     5.46573     0.056109    
thread10::batch_norm                              606         2858.39     1.81015     14.8521     4.71681     0.0507666   
thread10::sequence_conv                           1239        2675.24     0.879054    25.6072     2.15919     0.0475137   
thread10::mul                                     592         1756.27     0.793066    15.5724     2.96667     0.0311923   
thread10::sum                                     951         1405.63     0.075799    8.31579     1.47805     0.0249648   
thread10::concat                                  1147        1049.77     0.018676    26.1        0.915233    0.0186445   
thread10::elementwise_add                         1940        956.783     0.015292    9.07805     0.493187    0.016993    
thread10::elementwise_add_grad                    1961        803.534     0.021306    8.71669     0.409757    0.0142712   
thread10::lookup_table                            1729        736.748     0.020991    6.00964     0.426112    0.0130851   
thread10::sequence_pool                           1247        712.559     0.215601    23.461      0.571418    0.0126554   
thread10::sequence_pool_grad                      1280        506.368     0.056144    5.84544     0.3956      0.00899337  
thread10::concat_grad                             942         492.118     0.120531    4.16522     0.522418    0.00874028  
thread10::lookup_table_grad                       1752        475.525     0.02363     6.50643     0.271418    0.00844559  
thread10::tanh                                    1804        387.756     0.075507    0.61062     0.214942    0.00688676  
thread10::tanh_grad                               1836        299.045     0.055996    1.64046     0.162878    0.0053112   
thread10::scale                                   638         276.89      0.058454    5.68637     0.433997    0.00491773  
thread10::broadcast                               112         209.949     0.17165     6.21955     1.87454     0.00372881  
thread10::reduce                                  123         179.454     0.211193    15.9261     1.45897     0.00318719  
thread10::cos_sim_grad                            151         158.795     0.206391    8.27171     1.05162     0.00282028  
thread10::read                                    308         84.8472     0.038914    1.55864     0.275478    0.00150693  
thread10::auc                                     335         60.318      0.056311    0.955638    0.180054    0.00107128  
thread10::cos_sim                                 151         51.4833     0.148379    1.88589     0.340949    0.000914373 
thread10::split_selected_rows                     6           47.7025     3.3271      12.2915     7.95042     0.000847223 
thread10::cast                                    149         38.5146     0.016053    2.91507     0.258488    0.000684042 
thread10::recv                                    120         33.007      0.013064    0.938893    0.275058    0.000586223 
thread10::elementwise_sub                         328         29.2496     0.015662    1.02153     0.0891758   0.00051949  
thread10::elementwise_mul_grad                    140         28.4628     0.027844    1.46832     0.203306    0.000505515 
thread10::elementwise_sub_grad                    159         16.1476     0.016939    1.02455     0.101558    0.000286791 
thread10::mean                                    149         15.4905     0.011255    1.31164     0.103963    0.000275119 
thread10::square_grad                             151         15.0416     0.015333    0.576152    0.0996136   0.000267148 
thread10::fill_constant                           335         14.3254     0.005004    0.397451    0.0427624   0.000254427 
thread10::fill_constant_batch_size_like           154         13.6294     0.010946    1.24553     0.0885026   0.000242066 
thread10::mean_grad                               156         13.5895     0.020444    0.974843    0.0871119   0.000241356 
thread10::elementwise_mul                         180         12.4656     0.017426    0.382112    0.0692535   0.000221396 
thread10::square                                  135         9.55361     0.011617    0.557194    0.0707674   0.000169677 
thread10::send                                    123         6.68475     0.012929    0.316239    0.0543476   0.000118725 
thread10::create_double_buffer_reader             152         1.0503      0.002139    0.069017    0.00690989  1.8654e-05  
thread10::split_byref                             16          0.811218    0.026237    0.129718    0.0507011   1.44077e-05 
thread9::fetch_barrier                            4           45394       705.143     15411.9     11348.5     0.613797    
thread9::sequence_conv_grad                       1277        5035.46     1.65335     14.9289     3.94319     0.0680871   
thread9::batch_norm_grad                          630         3645.24     2.1917      17.6404     5.7861      0.0492893   
thread9::mul_grad                                 643         3262.63     1.68102     24.4812     5.07407     0.0441157   
thread9::batch_norm                               638         2770.15     1.77411     13.8411     4.34193     0.0374567   
thread9::sequence_conv                            1265        2659.35     0.858854    24.8404     2.10226     0.0359586   
thread9::mul                                      644         1855.66     0.790586    16.1129     2.88146     0.0250914   
thread9::sum                                      936         1511.9      0.075685    9.80199     1.61528     0.0204433   
thread9::concat                                   1122        1026.34     0.021173    26.8104     0.914742    0.0138777   
thread9::elementwise_add                          2056        983.16      0.014861    6.77585     0.478191    0.0132938   
thread9::elementwise_add_grad                     1977        803.739     0.020175    6.7102      0.406545    0.0108678   
thread9::lookup_table                             1756        750.176     0.024611    6.20442     0.427207    0.0101435   
thread9::sequence_pool                            1287        710.688     0.216884    23.5378     0.552205    0.0096096   
thread9::lookup_table_grad                        1770        465.588     0.023209    5.73702     0.263044    0.00629547  
thread9::sequence_pool_grad                       1250        464.589     0.05538     4.98275     0.371671    0.00628195  
thread9::concat_grad                              933         460.781     0.112279    3.77117     0.493871    0.00623047  
thread9::tanh                                     1853        397.275     0.075082    0.654249    0.214396    0.00537177  
thread9::tanh_grad                                1907        309.451     0.063233    0.582961    0.162271    0.00418426  
thread9::scale                                    646         296.676     0.065449    5.88007     0.45925     0.00401151  
thread9::broadcast                                123         203.166     0.156994    7.79936     1.65176     0.00274712  
thread9::cos_sim_grad                             160         158.428     0.199243    6.87161     0.990176    0.00214219  
thread9::reduce                                   103         156.828     0.173291    23.1737     1.5226      0.00212056  
thread9::split_selected_rows                      12          125.177     2.77551     23.8891     10.4314     0.00169259  
thread9::read                                     268         92.6759     0.044755    2.12933     0.345806    0.00125312  
thread9::cos_sim                                  175         54.7653     0.14472     0.883277    0.312944    0.000740511 
thread9::auc                                      289         50.1728     0.059346    1.11383     0.173608    0.000678414 
thread9::send_barrier                             2           49.9712     19.3172     30.654      24.9856     0.000675688 
thread9::cast                                     175         42.9469     0.01203     2.63408     0.245411    0.000580708 
thread9::recv                                     116         41.113      0.018141    0.967945    0.354423    0.000555912 
thread9::elementwise_mul_grad                     152         25.8188     0.028932    1.04823     0.169861    0.00034911  
thread9::elementwise_sub                          298         24.9781     0.013497    1.57411     0.0838192   0.000337743 
thread9::elementwise_sub_grad                     175         21.7677     0.015277    1.01042     0.124387    0.000294332 
thread9::square_grad                              164         16.6153     0.016859    0.830941    0.101313    0.000224665 
thread9::fill_constant                            335         15.9051     0.00612     0.720436    0.047478    0.000215062 
thread9::mean_grad                                142         14.9088     0.019454    0.770203    0.104991    0.00020159  
thread9::mean                                     143         13.9365     0.009414    0.672979    0.0974583   0.000188443 
thread9::elementwise_mul                          165         13.0013     0.018435    0.727815    0.078796    0.000175798 
thread9::square                                   142         12.9369     0.008229    0.664625    0.0911051   0.000174927 
thread9::fill_constant_batch_size_like            150         8.35739     0.011463    0.333245    0.055716    0.000113005 
thread9::send                                     127         8.26406     0.01099     0.401593    0.0650714   0.000111743 
thread9::create_double_buffer_reader              146         0.940069    0.001758    0.051403    0.00643883  1.27112e-05 
thread9::split_byref                              13          0.506866    0.026818    0.057698    0.0389897   6.85361e-06 
thread8::sequence_conv_grad                       1301        5205.06     1.58755     18.2666     4.00081     0.176824    
thread8::batch_norm_grad                          626         3725.31     2.18648     22.9447     5.95097     0.126554    
thread8::mul_grad                                 605         3073.84     1.64432     26.3762     5.08072     0.104423    
thread8::batch_norm                               626         2840.88     1.78994     15.6265     4.53814     0.0965088   
thread8::sequence_conv                            1207        2652.61     0.772446    23.959      2.19768     0.090113    
thread8::mul                                      640         1767.07     0.796368    16.1711     2.76104     0.0600299   
thread8::sum                                      905         1471.25     0.079453    9.65826     1.62569     0.0499806   
thread8::concat                                   1091        1144.54     0.016723    31.7004     1.04907     0.0388817   
thread8::elementwise_add                          1964        943.317     0.015429    7.22925     0.480304    0.0320459   
thread8::elementwise_add_grad                     2016        877.401     0.023014    7.81769     0.435219    0.0298066   
thread8::lookup_table                             1766        767.377     0.024078    6.37176     0.434528    0.0260689   
thread8::fetch_barrier                            1           739.621     739.621     739.621     739.621     0.025126    
thread8::sequence_pool                            1274        719.277     0.214302    23.5274     0.564582    0.0244349   
thread8::concat_grad                              917         483.764     0.125003    4.60696     0.527551    0.0164342   
thread8::sequence_pool_grad                       1251        473.204     0.061339    5.74253     0.37826     0.0160754   
thread8::lookup_table_grad                        1619        425.47      0.022791    5.93199     0.262798    0.0144538   
thread8::tanh                                     1811        391.202     0.078513    0.682612    0.216014    0.0132897   
thread8::tanh_grad                                1806        304.026     0.059536    2.72275     0.168342    0.0103282   
thread8::scale                                    569         248.02      0.072389    4.04737     0.435888    0.00842563  
thread8::send_barrier                             4           239.001     21.2392     117.19      59.7503     0.00811923  
thread8::broadcast                                112         206.607     0.171346    7.94773     1.8447      0.00701874  
thread8::cos_sim_grad                             155         173.788     0.20167     7.10267     1.12121     0.00590383  
thread8::reduce                                   98          85.2461     0.182759    10.1377     0.869858    0.00289594  
thread8::read                                     324         83.3764     0.045863    1.36234     0.257335    0.00283242  
thread8::auc                                      324         58.7427     0.045017    1.17139     0.181305    0.00199558  
thread8::split_selected_rows                      8           50.7198     3.2612      9.63971     6.33997     0.00172303  
thread8::cos_sim                                  153         47.9601     0.143878    1.33116     0.313465    0.00162928  
thread8::cast                                     156         37.1558     0.015545    2.38322     0.238178    0.00126224  
thread8::recv                                     116         32.2342     0.016892    0.898582    0.277881    0.00109504  
thread8::elementwise_sub                          333         29.7448     0.014535    1.12614     0.0893236   0.00101047  
thread8::elementwise_mul_grad                     146         25.9007     0.030991    1.69174     0.177402    0.000879886 
thread8::mean_grad                                166         17.4795     0.016978    0.882141    0.105298    0.000593806 
thread8::square_grad                              176         15.9734     0.018324    0.849144    0.0907581   0.000542641 
thread8::fill_constant_batch_size_like            162         13.0578     0.01306     0.601151    0.0806039   0.000443594 
thread8::square                                   160         12.6767     0.012978    0.680612    0.0792294   0.000430647 
thread8::fill_constant                            293         12.0972     0.006192    0.461571    0.0412874   0.00041096  
thread8::elementwise_mul                          170         11.9332     0.016044    0.571521    0.0701953   0.000405389 
thread8::elementwise_sub_grad                     140         11.1762     0.015815    0.466704    0.0798301   0.000379673 
thread8::mean                                     140         10.103      0.006696    0.535193    0.072164    0.000343213 
thread8::send                                     102         6.3699      0.015137    0.475607    0.06245     0.000216395 
thread8::create_double_buffer_reader              149         1.06017     0.001769    0.035788    0.00711526  3.60157e-05 
thread8::split_byref                              18          0.819699    0.02762     0.120458    0.0455388   2.78464e-05 
thread7::fetch_barrier                            3           44096.8     13420.1     15572.2     14698.9     0.605917    
thread7::sequence_conv_grad                       1314        4888.67     1.62081     21.8524     3.72045     0.0671733   
thread7::batch_norm_grad                          671         3509.46     2.18962     18.5318     5.2302      0.0482222   
thread7::mul_grad                                 706         3425.51     1.63792     27.8536     4.85199     0.0470686   
thread7::batch_norm                               667         2718.2      1.76818     11.7029     4.07526     0.0373497   
thread7::sequence_conv                            1353        2671.31     0.826143    22.8835     1.97436     0.0367054   
thread7::mul                                      706         1758.76     0.791725    15.2376     2.49117     0.0241665   
thread7::sum                                      1057        1439.99     0.071566    9.22598     1.36234     0.0197863   
thread7::concat                                   1120        1041.75     0.019332    16.0628     0.930138    0.0143143   
thread7::elementwise_add                          2191        1036.65     0.016752    7.22537     0.47314     0.0142442   
thread7::elementwise_add_grad                     2175        876.413     0.021103    9.96765     0.402949    0.0120425   
thread7::lookup_table                             1689        715.541     0.024653    5.94807     0.423648    0.00983197  
thread7::sequence_pool                            1294        714.91      0.210386    23.4828     0.55248     0.00982329  
thread7::sequence_pool_grad                       1378        524.019     0.054286    5.59645     0.380275    0.00720034  
thread7::concat_grad                              1024        496.354     0.110917    5.25344     0.484721    0.00682021  
thread7::lookup_table_grad                        1887        471.342     0.01907     6.03824     0.249784    0.00647653  
thread7::tanh                                     2049        425.456     0.07619     0.658785    0.207641    0.00584602  
thread7::tanh_grad                                1990        314.857     0.063509    0.86146     0.15822     0.00432633  
thread7::scale                                    639         314.81      0.07197     4.44925     0.49266     0.00432568  
thread7::reduce                                   134         224.631     0.225641    18.0432     1.67635     0.00308656  
thread7::broadcast                                117         214.48      0.1547      5.36566     1.83316     0.00294708  
thread7::send_barrier                             4           176.583     26.8622     70.0072     44.1458     0.00242636  
thread7::cos_sim_grad                             156         175.374     0.176672    9.27743     1.12419     0.00240975  
thread7::split_selected_rows                      13          110.802     5.23098     13.4726     8.52321     0.00152248  
thread7::read                                     308         86.0268     0.031585    1.89389     0.279308    0.00118206  
thread7::cos_sim                                  173         55.4263     0.14366     0.903815    0.320383    0.000761591 
thread7::auc                                      309         54.5062     0.057809    1.14647     0.176395    0.000748948 
thread7::cast                                     150         37.3937     0.012154    2.47006     0.249291    0.000513812 
thread7::recv                                     109         32.7459     0.016907    0.924388    0.300421    0.000449949 
thread7::elementwise_sub                          342         29.0007     0.01445     0.946714    0.0847974   0.000398487 
thread7::elementwise_mul_grad                     144         23.4808     0.028543    0.771677    0.163061    0.00032264  
thread7::square_grad                              172         15.9491     0.017878    1.57868     0.0927271   0.00021915  
thread7::mean                                     150         14.0158     0.008844    1.43049     0.0934387   0.000192586 
thread7::mean_grad                                162         13.9955     0.021047    0.626142    0.0863921   0.000192307 
thread7::square                                   162         13.5921     0.010093    1.42613     0.0839019   0.000186764 
thread7::fill_constant                            301         13.4386     0.005487    0.59923     0.0446464   0.000184654 
thread7::fill_constant_batch_size_like            176         13.239      0.013404    1.08081     0.0752217   0.000181912 
thread7::elementwise_sub_grad                     135         12.7983     0.016871    1.00232     0.0948023   0.000175857 
thread7::elementwise_mul                          161         10.0885     0.018115    0.544687    0.0626613   0.000138622 
thread7::send                                     120         6.49772     0.012681    0.400394    0.0541477   8.92827e-05 
thread7::create_double_buffer_reader              162         1.23463     0.00179     0.052463    0.00762119  1.69646e-05 
thread7::split_byref                              21          0.874182    0.02118     0.089751    0.0416277   1.20118e-05 
thread6::sequence_conv_grad                       1274        5016.19     1.69912     21.5263     3.93735     0.175358    
thread6::batch_norm_grad                          689         3679.94     2.20029     17.6266     5.34099     0.128645    
thread6::mul_grad                                 647         3181.16     1.66386     31.3188     4.91678     0.111208    
thread6::batch_norm                               660         2805.94     1.78986     11.1254     4.25142     0.0980911   
thread6::sequence_conv                            1327        2654.02     0.790426    23.3031     2.00002     0.0927804   
thread6::mul                                      662         1825.52     0.796126    19.3243     2.75759     0.0638173   
thread6::sum                                      917         1461.63     0.070714    9.32687     1.59392     0.0510961   
thread6::concat                                   1146        1141.83     0.020326    26.8824     0.996364    0.0399166   
thread6::elementwise_add                          2136        983.769     0.018764    6.84459     0.460566    0.034391    
thread6::elementwise_add_grad                     2100        871.283     0.024368    9.12019     0.414897    0.0304586   
thread6::lookup_table                             1775        732.499     0.02663     7.51795     0.412675    0.025607    
thread6::sequence_pool                            1278        707.365     0.20992     23.5113     0.553494    0.0247283   
thread6::sequence_pool_grad                       1260        477.593     0.054759    5.43147     0.379042    0.0166959   
thread6::concat_grad                              979         475.517     0.111264    3.0443      0.485717    0.0166233   
thread6::lookup_table_grad                        1823        465.947     0.022302    11.9122     0.255594    0.0162888   
thread6::tanh                                     1957        407.597     0.079513    0.676667    0.208277    0.0142489   
thread6::tanh_grad                                1966        317.761     0.064561    1.06132     0.161628    0.0111084   
thread6::scale                                    664         275.391     0.064515    5.29579     0.414746    0.00962723  
thread6::broadcast                                107         196.822     0.296909    6.9091      1.83945     0.00688056  
thread6::cos_sim_grad                             156         159.725     0.188262    11.4127     1.02388     0.00558373  
thread6::reduce                                   118         132.494     0.176529    16.5025     1.12283     0.00463178  
thread6::split_selected_rows                      13          114.103     2.75881     13.2467     8.77716     0.00398886  
thread6::read                                     316         80.9112     0.037007    1.45663     0.256048    0.00282852  
thread6::send_barrier                             2           78.5746     23.9206     54.6539     39.2873     0.00274684  
thread6::auc                                      362         61.1244     0.05714     1.38223     0.168852    0.00213681  
thread6::cos_sim                                  158         49.5619     0.151043    1.06912     0.313683    0.0017326   
thread6::cast                                     161         42.7501     0.01531     3.45233     0.265529    0.00149447  
thread6::recv                                     110         31.2973     0.014093    0.900887    0.284521    0.0010941   
thread6::elementwise_sub                          346         30.0732     0.015333    0.800613    0.0869168   0.00105131  
thread6::elementwise_mul_grad                     157         29.1312     0.026985    1.13342     0.185549    0.00101838  
thread6::square_grad                              161         17.6438     0.018998    1.31782     0.109589    0.000616798 
thread6::elementwise_sub_grad                     154         16.6417     0.014022    2.03981     0.108063    0.000581767 
thread6::mean_grad                                145         15.1737     0.015765    1.16953     0.104646    0.000530448 
thread6::fill_constant                            297         13.9849     0.005605    0.463413    0.0470871   0.000488889 
thread6::elementwise_mul                          171         12.216      0.016608    0.611339    0.0714388   0.000427053 
thread6::square                                   168         11.4379     0.01178     0.920068    0.068083    0.000399852 
thread6::mean                                     138         10.2905     0.006592    0.765124    0.0745689   0.00035974  
thread6::fill_constant_batch_size_like            159         9.70111     0.011959    0.385411    0.0610133   0.000339135 
thread6::send                                     139         8.83712     0.011692    0.4439      0.0635764   0.000308931 
thread6::create_double_buffer_reader              176         1.23829     0.001943    0.075787    0.00703576  4.32887e-05 
thread6::split_byref                              15          0.761707    0.02446     0.082714    0.0507805   2.6628e-05  
thread5::fetch_barrier                            3           16455.8     691.426     14673.5     5485.28     0.36337     
thread5::sequence_conv_grad                       1237        4937.97     1.61842     26.7579     3.99189     0.109038    
thread5::batch_norm_grad                          590         3724.12     2.18379     18.2984     6.31208     0.0822343   
thread5::mul_grad                                 603         3267.95     1.67994     27.6365     5.41948     0.0721612   
thread5::batch_norm                               605         2895.62     1.90974     14.547      4.78615     0.0639396   
thread5::sequence_conv                            1226        2653.67     0.862593    23.6912     2.16449     0.058597    
thread5::mul                                      596         1777.84     0.801573    21.5311     2.98295     0.0392573   
thread5::sum                                      966         1472.6      0.066126    11.021      1.52443     0.0325172   
thread5::concat                                   1169        1038.06     0.020236    26.8325     0.887993    0.022922    
thread5::elementwise_add                          2001        925.52      0.011574    5.71484     0.462529    0.0204369   
thread5::elementwise_add_grad                     1937        789.523     0.022682    10.3698     0.407601    0.0174338   
thread5::lookup_table                             1713        734.526     0.026364    9.46655     0.428795    0.0162194   
thread5::sequence_pool                            1236        704.767     0.223312    23.5786     0.5702      0.0155623   
thread5::sequence_pool_grad                       1307        511.414     0.060297    7.43059     0.391289    0.0112928   
thread5::lookup_table_grad                        1775        488.029     0.024588    8.83189     0.274946    0.0107764   
thread5::concat_grad                              900         464.808     0.120609    7.52225     0.516453    0.0102637   
thread5::tanh                                     1822        400.045     0.081667    0.652092    0.219564    0.0088336   
thread5::send_barrier                             3           317.32      77.0701     140.63      105.773     0.00700691  
thread5::tanh_grad                                1892        305.28      0.066369    0.707227    0.161353    0.00674104  
thread5::scale                                    620         259.838     0.071382    4.7153      0.419093    0.00573761  
thread5::reduce                                   130         257.804     0.160515    21.9779     1.98311     0.0056927   
thread5::broadcast                                114         209.187     0.151482    4.41026     1.83497     0.00461915  
thread5::cos_sim_grad                             156         163.499     0.19912     9.24948     1.04807     0.0036103   
thread5::read                                     278         96.4237     0.043374    1.61857     0.346848    0.00212918  
thread5::split_selected_rows                      9           75.5048     3.00244     14.0049     8.38942     0.00166726  
thread5::auc                                      317         56.6127     0.058583    1.11129     0.178589    0.00125009  
thread5::cast                                     181         51.7896     0.015039    2.56399     0.28613     0.00114359  
thread5::cos_sim                                  136         44.156      0.145186    1.22603     0.324677    0.000975032 
thread5::recv                                     110         31.8099     0.017351    0.986029    0.289181    0.000702411 
thread5::elementwise_mul_grad                     141         26.6379     0.022939    2.19058     0.188922    0.000588206 
thread5::elementwise_sub                          302         25.2029     0.01587     1.05107     0.0834535   0.000556519 
thread5::mean_grad                                186         19.264      0.019253    1.35656     0.10357     0.000425379 
thread5::mean                                     169         15.5487     0.00866     2.09921     0.0920039   0.000343338 
thread5::elementwise_sub_grad                     145         15.3487     0.018084    1.47946     0.105853    0.000338922 
thread5::square_grad                              152         14.6755     0.017737    0.961119    0.0965495   0.000324058 
thread5::fill_constant                            310         13.4351     0.005678    0.443868    0.043339    0.000296667 
thread5::square                                   174         12.874      0.010655    1.05055     0.0739884   0.000284277 
thread5::fill_constant_batch_size_like            145         12.4461     0.013582    0.69394     0.0858348   0.000274828 
thread5::elementwise_mul                          146         10.2102     0.018759    0.600416    0.0699331   0.000225457 
thread5::send                                     108         7.65606     0.013508    0.403571    0.0708894   0.000169057 
thread5::create_double_buffer_reader              178         1.26606     0.002086    0.051236    0.0071127   2.79565e-05 
thread5::split_byref                              14          0.671764    0.029159    0.091992    0.0479831   1.48336e-05 
thread4::fetch_barrier                            6           45005.3     779.496     15068.6     7500.89     0.610506    
thread4::sequence_conv_grad                       1226        4915.68     1.57775     19.5195     4.00953     0.0666821   
thread4::batch_norm_grad                          627         3751.64     2.19081     18.5176     5.98348     0.0508918   
thread4::mul_grad                                 604         3135.94     1.65237     27.2786     5.19196     0.0425397   
thread4::batch_norm                               623         2896.12     1.80976     13.0359     4.64866     0.0392863   
thread4::sequence_conv                            1251        2684.37     0.856778    24.7292     2.14578     0.0364139   
thread4::mul                                      611         1757.85     0.796196    14.0086     2.87701     0.0238456   
thread4::sum                                      927         1537.3      0.081091    8.60128     1.65836     0.0208538   
thread4::concat                                   1120        1022.1      0.019827    24.784      0.912589    0.013865    
thread4::elementwise_add                          1978        925.726     0.013907    6.8075      0.468011    0.0125576   
thread4::elementwise_add_grad                     2129        857.374     0.025906    8.93441     0.402712    0.0116304   
thread4::lookup_table                             1762        735.375     0.022126    5.52929     0.417353    0.00997551  
thread4::sequence_pool                            1255        705.521     0.217575    23.5603     0.562169    0.00957053  
thread4::concat_grad                              975         487.999     0.116203    4.08285     0.500512    0.0066198   
thread4::sequence_pool_grad                       1305        481.749     0.057428    7.13429     0.369156    0.00653502  
thread4::lookup_table_grad                        1753        474.013     0.020694    9.24527     0.270401    0.00643008  
thread4::tanh                                     1914        415.569     0.074146    0.646064    0.217121    0.00563727  
thread4::tanh_grad                                1934        316.161     0.062475    0.627872    0.163475    0.00428878  
thread4::send_barrier                             4           306.483     28.3505     96.4781     76.6208     0.0041575   
thread4::scale                                    555         230.302     0.06489     6.68327     0.414959    0.00312409  
thread4::broadcast                                124         221.34      0.132456    7.9524      1.785       0.00300252  
thread4::reduce                                   112         198.636     0.203311    28.4731     1.77353     0.00269453  
thread4::cos_sim_grad                             176         166.467     0.19225     9.87159     0.945835    0.00225816  
thread4::read                                     332         86.8419     0.03549     1.43534     0.261572    0.00117803  
thread4::split_selected_rows                      7           62.3308     5.8013      12.6304     8.9044      0.000845529 
thread4::auc                                      336         61.6384     0.051676    0.847169    0.183448    0.000836137 
thread4::cos_sim                                  144         45.9037     0.150551    1.01853     0.318775    0.000622692 
thread4::cast                                     157         38.061      0.01436     2.25206     0.242427    0.000516305 
thread4::recv                                     127         32.1108     0.011583    1.07391     0.252841    0.000435589 
thread4::elementwise_mul_grad                     148         24.9885     0.028817    1.14045     0.168841    0.000338973 
thread4::elementwise_sub                          296         21.8619     0.015304    0.683005    0.0738576   0.00029656  
thread4::square_grad                              169         19.2263     0.01944     1.05134     0.113765    0.000260808 
thread4::elementwise_sub_grad                     151         15.3715     0.01584     1.16456     0.101798    0.000208518 
thread4::fill_constant                            343         14.612      0.004554    0.606437    0.0426005   0.000198214 
thread4::mean_grad                                137         13.3247     0.019492    0.649568    0.0972609   0.000180753 
thread4::mean                                     166         12.6241     0.006911    0.532463    0.076049    0.000171249 
thread4::fill_constant_batch_size_like            145         12.1787     0.011969    1.01887     0.0839912   0.000165207 
thread4::elementwise_mul                          134         8.96961     0.016795    0.718967    0.0669374   0.000121674 
thread4::square                                   122         8.92471     0.009809    1.0685      0.0731534   0.000121065 
thread4::send                                     105         8.51385     0.013831    0.669003    0.0810843   0.000115492 
thread4::create_double_buffer_reader              170         1.06918     0.001913    0.064567    0.00628926  1.45036e-05 
thread4::split_byref                              13          0.509717    0.026731    0.052322    0.039209    6.91441e-06 
thread3::fetch_barrier                            3           16126.8     1027.06     13948       5375.6      0.360091    
thread3::sequence_conv_grad                       1283        4998.19     1.63697     18.7378     3.89571     0.111603    
thread3::batch_norm_grad                          633         3737.96     2.19388     19.8902     5.90515     0.0834639   
thread3::mul_grad                                 645         3273.08     1.64824     36.1187     5.07454     0.0730837   
thread3::batch_norm                               625         2804.44     1.77273     14.3121     4.4871      0.0626196   
thread3::sequence_conv                            1258        2671.06     0.873218    23.3088     2.12326     0.0596414   
thread3::mul                                      653         1845.6      0.792872    17.2394     2.82634     0.0412099   
thread3::sum                                      881         1332.22     0.07705     11.7231     1.51217     0.0297469   
thread3::concat                                   1141        1065.44     0.019129    26.9339     0.933774    0.0237899   
thread3::elementwise_add                          2024        946.873     0.016307    11.1211     0.467823    0.0211425   
thread3::elementwise_add_grad                     2030        859.024     0.020897    10.9512     0.423165    0.0191809   
thread3::lookup_table                             1746        763.35      0.022715    8.31356     0.437199    0.0170446   
thread3::sequence_pool                            1265        707.121     0.219961    23.6759     0.558989    0.0157891   
thread3::concat_grad                              927         467.381     0.114297    4.41937     0.504186    0.010436    
thread3::sequence_pool_grad                       1258        457.863     0.047415    6.07234     0.363961    0.0102235   
thread3::lookup_table_grad                        1710        455.806     0.023423    6.99781     0.266553    0.0101776   
thread3::tanh                                     1840        394.456     0.079123    0.623015    0.214378    0.0088077   
thread3::tanh_grad                                1859        300.715     0.064753    0.619348    0.161762    0.00671458  
thread3::reduce                                   129         280.761     0.213505    23.4755     2.17644     0.00626903  
thread3::scale                                    597         266.151     0.067507    5.75145     0.445814    0.00594281  
thread3::broadcast                                112         197.45      0.149688    5.25541     1.76295     0.00440881  
thread3::cos_sim_grad                             161         176.614     0.217214    5.08321     1.09698     0.00394357  
thread3::send_barrier                             4           157.596     19.5997     63.3453     39.3989     0.00351891  
thread3::read                                     292         88.6213     0.041031    1.11845     0.303498    0.0019788   
thread3::split_selected_rows                      8           60.6079     4.30495     14.3818     7.57599     0.0013533   
thread3::auc                                      305         56.2668     0.063798    0.97987     0.184481    0.00125637  
thread3::cos_sim                                  148         51.3628     0.150766    1.20545     0.347046    0.00114687  
thread3::cast                                     131         35.8008     0.012187    4.74631     0.273289    0.000799388 
thread3::recv                                     118         34.0804     0.017682    0.951278    0.288817    0.000760972 
thread3::elementwise_sub                          315         27.5182     0.017585    0.949066    0.0873595   0.000614447 
thread3::elementwise_mul_grad                     147         25.5643     0.028053    1.36328     0.173906    0.000570818 
thread3::mean_grad                                174         18.452      0.017932    1.32225     0.106046    0.00041201  
thread3::square_grad                              174         17.2508     0.013737    0.921739    0.0991426   0.000385189 
thread3::fill_constant                            317         13.9169     0.005746    0.436266    0.043902    0.000310748 
thread3::elementwise_sub_grad                     162         13.8559     0.016102    1.17051     0.08553     0.000309384 
thread3::fill_constant_batch_size_like            145         13.8456     0.010824    0.910915    0.0954871   0.000309155 
thread3::mean                                     157         13.8029     0.007901    1.02165     0.0879166   0.000308201 
thread3::elementwise_mul                          151         10.0779     0.017571    0.507239    0.066741    0.000225027 
thread3::square                                   149         9.44395     0.011693    0.605248    0.0633822   0.000210871 
thread3::send                                     104         6.91592     0.010207    0.38335     0.0664992   0.000154424 
thread3::create_double_buffer_reader              167         1.11548     0.001697    0.03213     0.00667954  2.49073e-05 
thread3::split_byref                              21          0.873184    0.017677    0.086735    0.0415802   1.94971e-05 
thread2::sequence_conv_grad                       1233        4987.19     1.60932     21.6324     4.04476     0.173628    
thread2::batch_norm_grad                          628         3814.35     2.18203     20.0726     6.0738      0.132796    
thread2::mul_grad                                 609         3181.56     1.65522     30.5159     5.22423     0.110765    
thread2::batch_norm                               629         2844.79     1.81178     12.4858     4.52272     0.0990411   
thread2::sequence_conv                            1205        2665.86     0.845448    23.2254     2.21233     0.0928114   
thread2::mul                                      563         1838.72     0.792125    17.3844     3.26594     0.0640149   
thread2::sum                                      896         1450.49     0.070377    7.98409     1.61885     0.0504987   
thread2::concat                                   1112        1068.94     0.018571    27.9796     0.961276    0.0372149   
thread2::elementwise_add                          1959        936.189     0.01578     5.40256     0.477891    0.0325933   
thread2::elementwise_add_grad                     1991        820.345     0.019232    5.49352     0.412026    0.0285602   
thread2::lookup_table                             1773        733.559     0.020635    6.09997     0.413739    0.0255388   
thread2::sequence_pool                            1260        716.089     0.217793    23.5135     0.568325    0.0249305   
thread2::concat_grad                              901         458.076     0.107967    4.32036     0.508409    0.0159479   
thread2::lookup_table_grad                        1684        457.041     0.020155    6.92962     0.271402    0.0159118   
thread2::sequence_pool_grad                       1147        451.831     0.055839    6.4527      0.393924    0.0157304   
thread2::tanh                                     1875        402.74      0.0818      0.67296     0.214794    0.0140213   
thread2::tanh_grad                                1903        305.981     0.055397    1.0031      0.160789    0.0106527   
thread2::scale                                    614         280.607     0.067163    6.81103     0.457015    0.00976929  
thread2::send_barrier                             3           242.469     43.4689     118.924     80.8231     0.00844154  
thread2::broadcast                                111         202.038     0.175994    7.94383     1.82016     0.00703391  
thread2::cos_sim_grad                             156         178.64      0.196283    7.42524     1.14513     0.00621934  
thread2::reduce                                   125         166.18      0.160538    16.3606     1.32944     0.00578554  
thread2::read                                     292         93.7926     0.035045    1.68874     0.321207    0.00326538  
thread2::split_selected_rows                      6           61.8458     6.03627     14.0812     10.3076     0.00215315  
thread2::auc                                      323         56.8357     0.051584    1.31947     0.175962    0.00197873  
thread2::cos_sim                                  163         50.7105     0.150525    1.0188      0.311107    0.00176548  
thread2::cast                                     175         46.6754     0.014429    3.46286     0.266717    0.001625    
thread2::recv                                     114         39.3734     0.016857    0.980981    0.345381    0.00137078  
thread2::elementwise_mul_grad                     162         34.1649     0.026424    1.66564     0.210894    0.00118944  
thread2::elementwise_sub                          283         21.5788     0.012652    0.743158    0.0762502   0.000751264 
thread2::elementwise_sub_grad                     156         16.6028     0.013518    1.35983     0.106428    0.000578023 
thread2::mean_grad                                158         16.416      0.017877    1.7871      0.103899    0.000571521 
thread2::square_grad                              160         14.6364     0.017292    0.685256    0.0914774   0.000509564 
thread2::mean                                     163         13.0237     0.008131    0.850193    0.0798999   0.000453418 
thread2::square                                   165         12.8957     0.012124    1.33776     0.0781556   0.000448961 
thread2::fill_constant                            299         12.8692     0.005556    0.737359    0.0430407   0.000448039 
thread2::fill_constant_batch_size_like            149         9.78944     0.014137    0.506359    0.0657009   0.000340818 
thread2::elementwise_mul                          145         9.42746     0.016499    0.713661    0.0650169   0.000328216 
thread2::send                                     119         6.96275     0.00978     0.507824    0.0585105   0.000242407 
thread2::create_double_buffer_reader              146         1.29025     0.001719    0.108989    0.00883736  4.492e-05   
thread2::split_byref                              15          0.789689    0.024626    0.102735    0.0526459   2.74929e-05 
thread1::fetch_barrier                            2           28710.5     13523.7     15186.8     14355.2     0.500549    
thread1::sequence_conv_grad                       1298        5049.28     1.67225     18.5906     3.89005     0.088031    
thread1::batch_norm_grad                          633         3603.08     2.1929      20.4361     5.69207     0.0628174   
thread1::mul_grad                                 639         3219.43     1.665       25.368      5.03823     0.0561287   
thread1::batch_norm                               639         2752.27     1.79805     12.0923     4.30715     0.047984    
thread1::sequence_conv                            1295        2692.65     0.820025    23.2822     2.07927     0.0469447   
thread1::mul                                      667         1860.42     0.789184    16.527      2.78923     0.0324352   
thread1::sum                                      981         1439.59     0.069277    10.8553     1.46748     0.0250984   
thread1::concat                                   1123        1047.95     0.02064     29.6785     0.933172    0.0182704   
thread1::elementwise_add                          2063        995.579     0.016068    7.57196     0.482588    0.0173573   
thread1::elementwise_add_grad                     2130        899.576     0.021646    10.8468     0.422336    0.0156835   
thread1::lookup_table                             1708        735.717     0.021545    6.21887     0.430747    0.0128267   
thread1::sequence_pool                            1267        705.242     0.213195    23.5736     0.556623    0.0122954   
thread1::concat_grad                              995         494.902     0.108628    3.56371     0.497389    0.00862831  
thread1::lookup_table_grad                        1907        492.522     0.02007     6.10832     0.258271    0.00858682  
thread1::sequence_pool_grad                       1330        471.118     0.062115    4.94855     0.354224    0.00821364  
thread1::tanh                                     1946        416.831     0.07861     0.72679     0.214199    0.00726718  
thread1::scale                                    614         305.592     0.07229     6.47611     0.497707    0.00532781  
thread1::tanh_grad                                1919        303.954     0.060346    0.673147    0.158392    0.00529924  
thread1::broadcast                                111         202.156     0.157805    5.60393     1.82123     0.00352447  
thread1::reduce                                   109         194.586     0.22739     20.6498     1.78519     0.00339248  
thread1::cos_sim_grad                             153         162.757     0.205553    7.67509     1.06377     0.00283757  
thread1::send_barrier                             3           113.072     26.7542     50.7378     37.6905     0.00197133  
thread1::read                                     332         83.9012     0.039914    1.76094     0.252714    0.00146276  
thread1::split_selected_rows                      6           60.9944     7.87294     16.6097     10.1657     0.0010634   
thread1::auc                                      283         52.5627     0.049418    1.24951     0.185734    0.000916397 
thread1::cos_sim                                  157         50.2024     0.142193    1.23465     0.319761    0.000875247 
thread1::cast                                     150         33.2348     0.014635    2.93171     0.221565    0.000579427 
thread1::recv                                     110         33.1227     0.016076    0.971415    0.301115    0.000577473 
thread1::elementwise_sub                          322         25.8503     0.012545    1.05623     0.0802804   0.000450684 
thread1::elementwise_mul_grad                     147         23.4623     0.032249    0.935644    0.159607    0.00040905  
thread1::mean_grad                                172         17.8072     0.017966    0.78329     0.10353     0.000310456 
thread1::elementwise_sub_grad                     170         17.6332     0.014866    0.801       0.103724    0.000307423 
thread1::square_grad                              159         15.6442     0.019057    0.705937    0.0983912   0.000272747 
thread1::fill_constant                            300         14.8799     0.006199    0.560092    0.0495998   0.000259422 
thread1::elementwise_mul                          187         14.2968     0.014804    0.736823    0.0764535   0.000249256 
thread1::square                                   158         13.2274     0.010822    1.8082      0.0837179   0.000230612 
thread1::mean                                     156         12.0512     0.007874    0.847896    0.0772516   0.000210106 
thread1::fill_constant_batch_size_like            155         11.8357     0.012104    0.725338    0.0763593   0.000206348 
thread1::send                                     111         6.26758     0.009864    0.352335    0.0564647   0.000109271 
thread1::create_double_buffer_reader              149         1.19366     0.001408    0.053387    0.00801114  2.08107e-05 
thread1::split_byref                              23          1.05312     0.022435    0.117102    0.0457877   1.83604e-05 
thread0::ScopeBufferedSSAGraphExecutorAfterRun    90          6600.67     52.3774     115.897     73.3408     0.623317    
thread0::ThreadedSSAGraphExecutorPrepare          90          3988.92     34.4373     81.4597     44.3214     0.376683    

pserver.log的Profiling如下:

------------------------->     Profiling Report     <-------------------------

Note! This Report merge all thread info into one.
Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event    Calls       Total       Min.        Max.        Ave.        Ratio.      
sum      720         14603       0.398116    218.89      20.282      0.829053    
scale    2160        100.124     0.006067    0.33895     0.0463536   0.00568429  
adam     720         2910.96     0.05964     51.2808     4.04299     0.165263    


------------------------->     Profiling Report     <-------------------------

Place: CPU
Time unit: ms
Sorted by event first end time in descending order in the same thread

Event              Calls       Total       Min.        Max.        Ave.        Ratio.      
thread56::sum      12          186.695     0.824641    149.836     15.5579     0.749089    
thread56::scale    36          1.16934     0.006696    0.137374    0.0324816   0.0046918   
thread56::adam     12          61.3652     0.245226    24.4686     5.11377     0.246219    
thread55::sum      12          483.673     0.766154    162.293     40.3061     0.976418    
thread55::scale    36          1.98347     0.006295    0.176165    0.0550963   0.00400414  
thread55::adam     12          9.69802     0.200259    3.01724     0.808168    0.0195779   
thread54::sum      12          159.376     0.948185    136.407     13.2813     0.730827    
thread54::scale    36          1.56606     0.006574    0.134693    0.0435016   0.00718125  
thread54::adam     12          57.134      0.160922    25.8742     4.76117     0.261992    
thread53::sum      12          315.124     0.89058     162.892     26.2604     0.790483    
thread53::scale    36          1.59852     0.006214    0.180902    0.0444034   0.00400986  
thread53::adam     12          81.925      0.267202    28.0194     6.82709     0.205507    
thread52::sum      12          32.6205     0.876013    9.20827     2.71838     0.273131    
thread52::scale    36          1.60248     0.00795     0.133358    0.0445134   0.0134175   
thread52::adam     12          85.2089     0.237518    27.8905     7.10074     0.713452    
thread51::sum      12          330.08      0.561279    157.69      27.5067     0.89516     
thread51::scale    36          1.40198     0.006239    0.107224    0.038944    0.0038021   
thread51::adam     12          37.2566     0.198774    24.3433     3.10472     0.101038    
thread50::sum      12          166.812     0.897113    140.615     13.901      0.922653    
thread50::scale    36          1.60449     0.00639     0.102774    0.0445692   0.00887459  
thread50::adam     12          12.3796     0.214678    5.11717     1.03164     0.0684729   
thread49::sum      12          297.091     0.795978    140.713     24.7576     0.898283    
thread49::scale    36          1.38741     0.006183    0.096483    0.0385391   0.00419496  
thread49::adam     12          32.2536     0.142105    24.1988     2.6878      0.0975219   
thread48::sum      13          156.685     0.918431    138.135     12.0527     0.597748    
thread48::scale    39          1.80445     0.006908    0.145863    0.046268    0.0068839   
thread48::adam     13          103.636     0.145078    26.4769     7.97203     0.395368    
thread47::sum      13          291.431     0.7818      142.378     22.4178     0.970983    
thread47::scale    39          2.30168     0.00631     0.187686    0.0590175   0.00766869  
thread47::adam     13          6.4075      0.204372    1.40484     0.492884    0.0213483   
thread46::sum      13          305.36      0.902332    138.543     23.4892     0.833553    
thread46::scale    39          1.55396     0.006762    0.140672    0.0398451   0.0042419   
thread46::adam     13          59.4216     0.207129    24.3573     4.57089     0.162205    
thread45::sum      13          37.9585     1.04528     8.58173     2.91989     0.498705    
thread45::scale    39          1.73666     0.008507    0.142952    0.0445298   0.0228165   
thread45::adam     13          36.419      0.226543    24.6673     2.80146     0.478479    
thread44::sum      13          201.441     0.758061    156.317     15.4955     0.827726    
thread44::scale    39          1.67924     0.006662    0.177386    0.0430575   0.00690005  
thread44::adam     13          40.2464     0.279416    24.491      3.09588     0.165374    
thread43::sum      13          456.888     0.961683    165.465     35.1453     0.926578    
thread43::scale    39          1.93931     0.0066      0.158156    0.049726    0.00393296  
thread43::adam     13          34.2646     0.174004    24.2578     2.63574     0.0694892   
thread42::sum      13          299.918     0.909198    137.527     23.0707     0.887942    
thread42::scale    39          1.90426     0.006396    0.11771     0.0488273   0.00563778  
thread42::adam     13          35.9455     0.20051     24.2599     2.76504     0.106421    
thread41::sum      13          296.412     0.622487    137.236     22.801      0.778724    
thread41::scale    39          1.67193     0.006507    0.133644    0.0428699   0.00439243  
thread41::adam     13          82.5544     0.118687    25.5742     6.35034     0.216884    
thread40::sum      13          330.701     0.913908    170.893     25.4385     0.859859    
thread40::scale    39          2.31203     0.006914    0.150784    0.0592827   0.00601152  
thread40::adam     13          51.5861     0.227056    42.2294     3.96816     0.13413     
thread39::sum      13          300.715     0.917226    140.377     23.1319     0.831133    
thread39::scale    39          1.76612     0.006513    0.211945    0.0452853   0.00488131  
thread39::adam     13          59.3324     0.142193    26.5372     4.56403     0.163986    
thread38::sum      13          153.146     0.889188    136.114     11.7805     0.730058    
thread38::scale    39          2.13414     0.006529    0.175277    0.0547216   0.0101736   
thread38::adam     13          54.4922     0.231962    25.2431     4.1917      0.259768    
thread37::sum      13          349.404     0.682413    177.63      26.8772     0.954478    
thread37::scale    39          1.73901     0.006741    0.184736    0.0445901   0.00475051  
thread37::adam     13          14.9252     0.124565    3.0241      1.1481      0.0407717   
thread36::sum      13          305.996     0.978411    146.479     23.5382     0.829802    
thread36::scale    39          2.02339     0.007391    0.132695    0.0518818   0.00548705  
thread36::adam     13          60.7384     0.127481    27.9429     4.67219     0.164711    
thread35::sum      13          494.258     0.823066    162.477     38.0198     0.885893    
thread35::scale    39          2.1029      0.006147    0.181214    0.0539206   0.00376918  
thread35::adam     13          61.5597     0.156423    26.0769     4.73536     0.110338    
thread34::sum      13          187.21      0.769729    163.458     14.4008     0.762016    
thread34::scale    39          1.7792      0.006439    0.161436    0.0456206   0.00724202  
thread34::adam     13          56.6883     0.23822     24.5173     4.36064     0.230742    
thread33::sum      13          54.5418     1.16457     8.6478      4.19552     0.436209    
thread33::scale    39          1.43874     0.007853    0.152865    0.0368907   0.0115066   
thread33::adam     13          69.0554     0.188671    27.7681     5.31195     0.552284    
thread32::sum      13          51.0084     0.999526    8.72337     3.92372     0.533508    
thread32::scale    39          1.84941     0.008416    0.156523    0.0474208   0.0193434   
thread32::adam     13          42.7517     0.289186    24.4924     3.28859     0.447149    
thread31::sum      13          315.462     1.00572     153.417     24.2663     0.823851    
thread31::scale    39          1.79913     0.006493    0.129017    0.0461317   0.00469856  
thread31::adam     13          65.6505     0.162062    33.1232     5.05004     0.171451    
thread30::sum      13          667.29      0.973816    218.89      51.33       0.949844    
thread30::scale    39          1.83992     0.006423    0.134325    0.0471775   0.00261901  
thread30::adam     13          33.3957     0.11876     25.4171     2.5689      0.0475366   
thread29::sum      13          24.8553     0.691669    8.19566     1.91195     0.303589    
thread29::scale    39          1.94978     0.0092      0.11974     0.0499943   0.023815    
thread29::adam     13          55.0665     0.095966    24.4508     4.23588     0.672596    
thread28::sum      13          47.6725     1.01185     17.2035     3.66712     0.432911    
thread28::scale    39          2.03644     0.008328    0.200611    0.0522165   0.0184928   
thread28::adam     13          60.4118     0.156938    24.3182     4.64706     0.548596    
thread27::sum      13          298.984     0.91568     137.445     22.9988     0.777921    
thread27::scale    39          1.48761     0.006456    0.127339    0.0381438   0.00387058  
thread27::adam     13          83.8658     0.289919    27.1467     6.45122     0.218209    
thread26::sum      13          308.024     0.398116    141.417     23.6942     0.953681    
thread26::scale    39          2.10961     0.006736    0.190414    0.0540924   0.0065316   
thread26::adam     13          12.8508     0.127671    2.89404     0.988524    0.0397877   
thread25::sum      13          426.326     1.01315     137.71      32.7943     0.872783    
thread25::scale    39          1.97765     0.006067    0.159565    0.050709    0.00404869  
thread25::adam     13          60.1639     0.05964     26.7858     4.628       0.123169    
thread24::sum      13          155.746     0.816433    135.003     11.9805     0.934608    
thread24::scale    39          2.76032     0.006362    0.281195    0.0707775   0.0165642   
thread24::adam     13          8.1368      0.255332    2.84823     0.625908    0.0488276   
thread23::sum      13          25.8642     0.874502    8.0217      1.98955     0.310776    
thread23::scale    39          2.35844     0.007685    0.33895     0.0604728   0.0283383   
thread23::adam     13          55.0019     0.134988    24.7021     4.23091     0.660886    
thread22::sum      13          301.314     0.880579    143.149     23.178      0.959843    
thread22::scale    39          2.66436     0.006628    0.190116    0.068317    0.00848738  
thread22::adam     13          9.9416      0.290237    3.34103     0.764738    0.0316692   
thread21::sum      13          34.9364     0.962374    8.35851     2.68742     0.295531    
thread21::scale    39          1.55142     0.008527    0.112243    0.0397799   0.0131236   
thread21::adam     13          81.7278     0.231207    24.3374     6.28675     0.691345    
thread20::sum      13          446.042     1.01206     157.307     34.3109     0.836679    
thread20::scale    39          1.90418     0.006581    0.130494    0.048825    0.00357183  
thread20::adam     13          85.164      0.235525    26.8123     6.55108     0.159749    
thread19::sum      13          341.439     0.837449    163.682     26.2645     0.837438    
thread19::scale    39          1.47895     0.006154    0.199478    0.0379217   0.00362738  
thread19::adam     13          64.8007     0.151909    24.4713     4.98467     0.158935    
thread18::sum      13          456.053     0.796035    161.834     35.081      0.926317    
thread18::scale    39          1.97224     0.006071    0.293454    0.0505702   0.00400593  
thread18::adam     13          34.3038     0.131301    24.725      2.63876     0.0696766   
thread17::sum      13          318.894     0.996461    142.069     24.5303     0.827401    
thread17::scale    39          1.95817     0.006377    0.254958    0.0502094   0.00508066  
thread17::adam     13          64.5644     0.241266    24.6702     4.9665      0.167519    
thread16::sum      13          291.522     0.683259    140.136     22.4248     0.836797    
thread16::scale    39          1.43785     0.006556    0.084183    0.0368679   0.00412725  
thread16::adam     13          55.4187     0.12446     26.1774     4.26298     0.159076    
thread15::sum      13          303.415     0.757413    136.836     23.3396     0.885039    
thread15::scale    39          1.65926     0.006323    0.14057     0.042545    0.00483993  
thread15::adam     13          37.7523     0.103088    24.5471     2.90403     0.110121    
thread14::sum      13          165.77      0.687292    135.521     12.7515     0.599042    
thread14::scale    39          1.42776     0.006194    0.157708    0.0366093   0.0051595   
thread14::adam     13          109.527     0.222062    51.2808     8.42519     0.395799    
thread13::sum      13          23.8163     0.846704    7.99499     1.83203     0.260574    
thread13::scale    39          1.74557     0.008465    0.157696    0.0447583   0.0190983   
thread13::adam     13          65.8377     0.074946    33.6993     5.06444     0.720328    
thread12::sum      13          44.455      0.908589    8.0047      3.41961     0.39983     
thread12::scale    39          1.77746     0.00865     0.260496    0.0455759   0.0159865   
thread12::adam     13          64.9523     0.207097    26.8685     4.99633     0.584184    
thread11::sum      13          162.658     0.595326    135.515     12.5121     0.576125    
thread11::scale    39          1.49146     0.006815    0.098827    0.0382424   0.00528266  
thread11::adam     13          118.181     0.077595    46.4188     9.09087     0.418592    
thread10::sum      13          476.553     0.719884    183.813     36.6579     0.977591    
thread10::scale    39          1.74765     0.006167    0.121551    0.0448116   0.0035851   
thread10::adam     13          9.17602     0.081131    2.86736     0.705848    0.0188235   
thread9::sum       13          599.527     0.792357    155.704     46.1174     0.975547    
thread9::scale     39          1.51719     0.00653     0.129496    0.0389022   0.00246876  
thread9::adam      13          13.5103     0.066592    3.06828     1.03926     0.0219839   
thread8::sum       13          296.805     0.828923    137.591     22.8311     0.823932    
thread8::scale     39          1.87482     0.006359    0.142041    0.0480724   0.00520452  
thread8::adam      13          61.5501     0.12595     27.2032     4.73462     0.170864    
thread7::sum       13          443.712     0.647475    148.035     34.1317     0.978329    
thread7::scale     39          1.98795     0.006271    0.1448      0.050973    0.00438318  
thread7::adam      13          7.84051     0.119144    2.07625     0.603116    0.0172873   
thread6::sum       13          193.585     0.846455    150.551     14.8912     0.74553     
thread6::scale     39          1.43208     0.00709     0.144018    0.03672     0.00551519  
thread6::adam      13          64.6441     0.136264    24.7204     4.97262     0.248955    
thread5::sum       13          200.474     0.884548    169.473     15.4211     0.682018    
thread5::scale     39          1.58189     0.006339    0.139907    0.0405613   0.00538162  
thread5::adam      13          91.8868     0.166478    34.106      7.06822     0.312601    
thread4::sum       13          36.4495     0.819802    9.13764     2.80381     0.250944    
thread4::scale     39          1.40816     0.008029    0.127165    0.0361066   0.00969473  
thread4::adam      13          107.392     0.092883    26.2525     8.26092     0.739361    
thread3::sum       13          348.998     0.564498    179.528     26.846      0.964135    
thread3::scale     39          1.74711     0.006443    0.115286    0.0447976   0.00482651  
thread3::adam      13          11.2355     0.118935    2.87944     0.864267    0.0310388   
thread2::sum       13          167.072     0.824552    138.196     12.8517     0.819188    
thread2::scale     39          1.78716     0.006434    0.171067    0.0458245   0.00876281  
thread2::adam      13          35.0891     0.06608     25.2633     2.69916     0.172049    
thread1::sum       13          434.773     0.930336    139.571     33.4441     0.922999    
thread1::scale     39          1.602       0.006155    0.204693    0.0410768   0.00340095  
thread1::adam      13          34.6689     0.247305    24.6821     2.66684     0.0736002   

代码可以私hi:caowei07

指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/Paddle#15553
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7