`elementwise_add_grad` should be optimized
Created by: wanghaoshuang
elementwise_add_grad
should be optimized to avoid the effect of eigen.
-------------------------> Profiling Report <-------------------------
Place: CUDA
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total Min. Max. Ave.
thread0::elementwise_add_grad 176 7243.75 0.04736 168.678 41.1577
thread0::warpctc 16 7202.06 1.28614 7180.67 450.129
thread0::conv2d_grad 128 752.047 2.50992 17.4375 5.87536
thread0::conv2d 256 491.638 1.08138 4.74362 1.92046
thread0::batch_norm 256 301.695 0.076608 5.48499 1.17849
thread0::gru 64 299.851 4.54563 5.25917 4.68517
thread0::batch_norm_grad 128 287.493 0.343488 6.05075 2.24604
thread0::gru_grad 32 192.225 5.83638 6.24419 6.00702
thread0::elementwise_add 576 89.5377 0.009984 0.579264 0.155447
thread0::relu 256 57.8738 0.062848 0.482432 0.22607
thread0::mul 128 52.3439 0.191296 0.6896 0.408937
thread0::relu_grad 128 40.9168 0.086784 0.685856 0.319662
thread0::pool2d_grad 64 30.8992 0.135456 0.989888 0.4828
thread0::pool2d 128 24.5201 0.057664 0.394848 0.191563
thread0::mul_grad 64 21.6162 0.043584 0.628768 0.337753
thread0::momentum 688 17.9174 0.008928 0.317664 0.0260427
thread0::im2sequence 32 16.2841 0.504608 0.515488 0.508877
thread0::ctc_align 32 16.0577 0.469184 0.665344 0.501804
thread0::warpctc_grad 16 15.68 0.943936 1.09715 0.98
thread0::top_k 32 9.99485 0.218304 1.05782 0.312339
thread0::im2sequence_grad 16 8.83674 0.54448 0.565408 0.552296
thread0::edit_distance 32 8.6968 0.244896 0.601024 0.271775
thread0::sum 112 8.36666 0.019328 0.235168 0.0747023
thread0::scale 256 7.41229 0.00928 0.1416 0.0289542
thread0::clip 224 6.79677 0.009984 0.141408 0.0303427
thread0::cast 36 6.31331 0.048736 0.229376 0.17537
thread0::feed 64 4.81792 0.048608 0.194784 0.07528
thread0::fill_zeros_like 512 4.63562 0.00736 0.010624 0.00905394
thread0::fetch 36 1.31088 0.024704 0.075424 0.0364133
thread0::reduce_sum 32 1.00794 0.026752 0.078976 0.031498
thread0::fill_constant 24 0.594816 0.017376 0.054784 0.024784
thread0::mean 16 0.541664 0.027392 0.0888 0.033854
thread0::elementwise_div 4 0.285696 0.042336 0.099936 0.071424