The profiling results for ResNet.
Created by: qingqing01
Config and Env
- ResNet 50: https://github.com/PaddlePaddle/Paddle/pull/6056
- Input: 3 * 224 * 224
- batch_size = 32, 80 iterations
- Env: nvidia-docker, TITAN X (Pascal), single machine.
The profiling results
- By code: https://github.com/PaddlePaddle/Paddle/compare/develop...qingqing01:add_timer?expand=1
- Compling mode: Release
- CMake config:
-DWITH_GPU=ON -DWITH_TIMER=ON -DWITH_MKL=ON -DCMAKE_BUILD_TYPE=Release
Total examples: 2560, total time: 47.83653 sec
53.51559 examples/sec, 1.49489 sec/batch
unit: ms
Stat=ExecutorRunTimer total=37099.9 avg=227.606 max=526.269 min=0.092 count=163
Stat=conv2d_grad total=18104.7 avg=4.269 max=15.852 min=2.222 count=4240
Stat=conv2d total=9482.79 avg=2.236 max=10.438 min=0.601 count=4240
~~Stat=Im2ColTimer total=461.405 avg=0.007 max=2.732 min=0.004 count=58880~~
~~Stat=GemmTimer total=2343.45 avg=0.017 max=5.965 min=0.007 count=135680~~
Stat=batch_norm_grad total=2008.38 avg=0.473 max=5.74 min=0.086 count=4240
Stat=batch_norm total=1679.04 avg=0.396 max=4.169 min=0.09 count=4240
Stat=sum total=1347.13 avg=0.935 max=3.724 min=0.013 count=1440
Stat=relu_grad total=988.555 avg=0.252 max=5.071 min=0.044 count=3920
Stat=elementwise_add_grad total=706.461 avg=0.519 max=6.632 min=0.044 count=1360
Stat=relu total=694.561 avg=0.177 max=2.699 min=0.031 count=3920
Stat=elementwise_add total=560.761 avg=0.412 max=3.953 min=0.02 count=1360
Stat=momentum total=516.091 avg=0.04 max=9.432 min=0.02 count=12880
Stat=pool2d_grad total=326.743 avg=2.042 max=6.607 min=0.241 count=160
Stat=CreateOpTimer total=237.364 avg=0.005 max=7.503 min=0.001 count=44433
Stat=DeleteLocalScopeTimer total=108.412 avg=0.665 max=2.648 min=0.001 count=163
Stat=CreateLocalScopeTimer total=76.612 avg=0.47 max=1.804 min=0.004 count=163
Stat=pool2d total=73.987 avg=0.462 max=0.675 min=0.315 count=160
Stat=fill_constant total=19.168 avg=0.03 max=12.984 min=0.006 count=619
Stat=cast total=12.543 avg=0.039 max=0.219 min=0.016 count=320
Stat=gaussian_random total=11.399 avg=0.215 max=1.26 min=0.015 count=53
Stat=mul total=9.389 avg=0.117 max=0.212 min=0.092 count=80
Stat=accuracy total=8.361 avg=0.104 max=0.477 min=0.068 count=80
Stat=mul_grad total=7.426 avg=0.092 max=0.179 min=0.075 count=80
Stat=feed total=6.262 avg=0.039 max=0.185 min=0.015 count=160
Stat=softmax total=5.667 avg=0.07 max=0.677 min=0.042 count=80
Stat=fetch total=5.084 avg=0.021 max=0.254 min=0.01 count=240
Stat=top_k total=3.373 avg=0.042 max=0.126 min=0.026 count=80
Stat=softmax_grad total=3.074 avg=0.038 max=0.1 min=0.029 count=80
Stat=cross_entropy_grad total=2.604 avg=0.032 max=0.091 min=0.023 count=80
Stat=cross_entropy total=2.429 avg=0.03 max=0.075 min=0.021 count=80
Stat=elementwise_div total=2.412 avg=0.03 max=0.096 min=0.02 count=80
Stat=mean total=1.997 avg=0.024 max=0.077 min=0.018 count=80
Stat=mean_grad total=1.838 avg=0.022 max=0.058 min=0.016 count=80
Stat=uniform_random total=0.044 avg=0.044 max=0.044 min=0.044 count=1
--------------------------------------------------
The operators needing to optimize
- Conv2d/Conv2d_grad
- The total time of conv2d is 9482.79ms,
But the mainly computing time of. (There is no stream synchronization between im2col and gemm, so the time for im2col and gemm is not accurate.)im2col
andgemm
is461.405 + 2343.45 =2804.855ms
- The total time of conv2d is 9482.79ms,
- relu/relu_grad
- elementwise_add/elementwise_add_grad
- momentum
- sum
The time of Python accounts about 22% of total time.
(47.83653-37.0999)/47.83653 = 22.44 %