用GPU训练,80%的时间都在读数据,有办法优化吗
Created by: bluemandora
1)PaddlePaddle版本:1.3.1 2)GPU:K40M,CUDA9.0,CUDNN7.1 4)系统环境:CentOS6.3,python2.7
- 训练信息 1)单机多卡 2)显存信息 3)Operator信息 单机2卡跑一个推荐的ctr预估模型,发现速度跟CPU训练差不多 使用profiler分析发现耗时80%是在read操作上,请问该如何优化这一部分?
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
read 2004 236470 236277.730512 (0.999186)192.419592 (0.000814) 0.125919 322.577 117.999 0.805451
lookup_table_grad 63000 14089.9 11911.488091 (0.845393) 2178.392463 (0.154607) 0.096389 5.07249 0.223649 0.0479921
lookup_table 63000 8058.13 6718.019685 (0.833695) 1340.106067 (0.166305) 0.075689 4.7326 0.127907 0.0274471
reduce 15000 6758.71 5373.589342 (0.795061) 1385.119694 (0.204939) 0.106 4.09572 0.450581 0.0230211
broadcast 15000 6365.13 4775.667701 (0.750286) 1589.460016 (0.249714) 0.117815 3.65304 0.424342 0.0216805
sum 1000 3881.38 1100.904596 (0.283638) 2780.471297 (0.716362) 3.43192 7.16132 3.88138 0.0132205
BufferedReader:MemoryCopy 1001 2832.07 2829.449154 (0.999074) 2.622825 (0.000926) 1.92755 4.01372 2.82924 0.00964644
GpuMemcpySync:CPU->GPU 64064 2516.78 2367.621026 (0.940736) 149.154221 (0.059264) 0.025147 0.909499 0.0392853 0.00857249
scale 30000 2142.5 2038.901259 (0.951647) 103.596234 (0.048353) 0.040674 3.27101 0.0714166 0.00729765
ScopeBufferedSSAGraphExecutorAfterRun 501 1541.83 1539.415928 (0.998432) 2.417507 (0.001568) 0.183202 7.60096 3.07751 0.0052517
adam 15000 1433.6 1290.690076 (0.900317) 142.905720 (0.099683) 0.053356 2.74961 0.0955731 0.00488303
mul_grad 3000 1378.22 633.652211 (0.459761) 744.570104 (0.540239) 0.230314 3.39961 0.459407 0.00469442
auc 2000 1038.96 1038.441179 (0.999501) 0.518191 (0.000499) 0.255021 4.45519 0.51948 0.00353884
concat_grad 1000 617.564 585.327166 (0.947801) 32.236382 (0.052199) 0.406266 3.84687 0.617564 0.00210351
mul 3000 598.076 284.760975 (0.476129) 313.314835 (0.523871) 0.116232 0.92284 0.199359 0.00203713
elementwise_add_grad 3000 536.845 348.344141 (0.648872) 188.501358 (0.351128) 0.075263 5.16344 0.178948 0.00182857
TensorCopy:GPU->CPU 4000 428.652 428.524241 (0.999702) 0.127637 (0.000298) 0.042014 0.288626 0.107163 0.00146005
GpuMemcpySync:GPU->CPU 4000 399.387 389.634153 (0.975581) 9.752436 (0.024419) 0.038693 0.278839 0.0998466 0.00136037
elementwise_add 3000 322.904 295.766248 (0.915958) 27.137521 (0.084042) 0.063547 0.340601 0.107635 0.00109986
softmax 1000 240.356 221.625307 (0.922069) 18.731166 (0.077931) 0.15784 0.463071 0.240356 0.000818688
relu_grad 2000 225.316 204.473335 (0.907494) 20.843134 (0.092506) 0.061577 0.356497 0.112658 0.00076746