Created by: qingqing01
Fix https://github.com/PaddlePaddle/Paddle/issues/7577
- Add profiler in the executor to record time for each op.
- Expose profiler to Python.
- The output is as follows in the unit test
-------------------------> Profiling Report <-------------------------
Place: CPU
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total Min. Max. Ave.
thread0::mul_grad 24 3.9817 0.030622 2.19622 0.165904
thread0::mul 24 2.43597 0.018009 0.257821 0.101499
thread0::momentum 48 1.65413 0.012138 0.128806 0.0344611
thread0::softmax 8 1.04434 0.129704 0.131114 0.130543
thread0::elementwise_add_grad 24 0.427291 0.01508 0.023408 0.0178038
thread0::elementwise_add 24 0.35214 0.009682 0.020287 0.0146725
thread0::cast 32 0.286981 0.006007 0.0149 0.00896816
thread0::relu_grad 16 0.268582 0.013828 0.02076 0.0167864
thread0::top_k 8 0.251921 0.02915 0.036 0.0314901
thread0::softmax_grad 8 0.196288 0.024375 0.024798 0.024536
thread0::sum 16 0.172279 0.009126 0.012375 0.0107674
thread0::feed 16 0.1705 0.00402 0.017667 0.0106562
thread0::relu 16 0.132929 0.007018 0.009637 0.00830806
thread0::accuracy 8 0.116111 0.01412 0.015031 0.0145139
thread0::cross_entropy_grad 8 0.106584 0.012811 0.014423 0.013323
thread0::cross_entropy 8 0.096511 0.011705 0.012282 0.0120639
thread0::elementwise_div 8 0.089527 0.010935 0.01219 0.0111909
thread0::mean_grad 8 0.079708 0.008923 0.015726 0.0099635
thread0::fetch 24 0.067049 0.001737 0.003392 0.00279371
thread0::mean 8 0.05091 0.006111 0.006513 0.00636375
thread0::fill_constant 8 0.038575 0.004709 0.004976 0.00482187