Created by: zhangting2020
background
- Related to: benchmark #335. The default type of tensor's index is long in Eigen, this PR use 32 bit (int) index to improve expand op.
- test case:
config = ExpandConfig(x_shape=[32, 807, 1],
expand_times=[4, 1, 807])
perfomance
GPU:CUDA10,v100,use nvprof
before | after | speed up |
---|---|---|
1.2544 ms | 0.7556 ms | 39.8% |
CPU:6148,use profile
before | after | speed up |
---|---|---|
3765.22 ms | 1136.7 ms | 69.8% |
details
GPU
before:
W0325 06:53:01.038626 29504 device_context.cc:235] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 10.1, Runtime API Version: 10.0
W0325 06:53:01.044080 29504 device_context.cc:243] device: 0, cuDNN Version: 7.5.
feed random data
{
framework: "paddle",
version: "0.0.0",
name: "expand",
device: "GPU",
speed: { repeat: 100, start: 10, end: 90, total: 777.73128, feed: 0.00000, compute: 0.00000, fetch: 0.00000 }
}
==29504== Profiling application: python expand.py --dtype float32 --run_with_executor True --check_output False --profiler none --backward False --use_gpu True --repeat 100 --log_level 0
==29504== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 97.98% 6.19575s 100 61.958ms 54.682ms 339.99ms [CUDA memcpy DtoH]
1.98% 125.44ms 100 1.2544ms 1.2529ms 1.2604ms void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=3, int=1, long>, int=0, Eigen::MakePointer>, Eigen::TensorBroadcastingOp<Eigen::DSizes<int, int=3> const , Eigen::TensorMap<Eigen::Tensor<float const , int=3, int=1, long>, int=0, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=3)
0.03% 2.0082ms 103 19.497us 2.4960us 29.215us [CUDA memcpy HtoD]
0.00% 7.5190us 4 1.8790us 1.5990us 2.7200us [CUDA memset]
API calls: 47.07% 6.35675s 203 31.314ms 19.971us 343.26ms cudaMemcpy
22.75% 3.07266s 8 384.08ms 1.6010us 3.07231s cudaStreamCreateWithFlags
9.88% 1.33484s 1 1.33484s 1.33484s 1.33484s cudaStreamCreate
9.74% 1.31526s 5 263.05ms 23.420us 1.31509s cudaMemGetInfo
6.07% 819.79ms 406 2.0192ms 5.6880us 66.099ms cuModuleUnload
4.28% 578.66ms 4 144.67ms 750ns 578.66ms cudaFree
0.10% 12.846ms 14 917.58us 5.2290us 12.290ms cudaMalloc
0.04% 5.9010ms 100 59.010us 53.821us 73.255us cudaLaunchKernel
- after:
W0416 04:00:06.941998 19678 device_context.cc:237] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 10.1, Runtime API Version: 10.0
W0416 04:00:07.149387 19678 device_context.cc:245] device: 0, cuDNN Version: 7.5.
{
framework: "paddle",
version: "0.0.0",
name: "expand",
device: "GPU",
speed: { repeat: 100, start: 10, end: 90, total: 711.64674, feed: 0.00000, compute: 0.00000, fetch: 0.00000 }
}
==19678== Profiling application: python expand.py --dtype float32 --run_with_executor True --check_output False --profiler none --backward False --use_gpu True --repeat 100 --log_level 0
==19678== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 98.84% 6.58699s 100 65.870ms 61.893ms 306.97ms [CUDA memcpy DtoH]
1.13% 75.555ms 100 755.55us 755.13us 757.56us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=3, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorBroadcastingOp<Eigen::DSizes<int, int=3> const , Eigen::TensorMap<Eigen::Tensor<float const , int=3, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=3)
0.03% 2.0074ms 103 19.489us 2.2720us 29.664us [CUDA memcpy HtoD]
0.00% 7.9670us 4 1.9910us 1.6000us 2.3990us [CUDA memset]
API calls: 40.03% 6.70104s 203 33.010ms 22.238us 309.42ms cudaMemcpy
37.08% 6.20759s 8 775.95ms 1.8530us 6.20726s cudaStreamCreateWithFlags
9.26% 1.55004s 1 1.55004s 1.55004s 1.55004s cudaStreamCreate
8.71% 1.45876s 4 364.69ms 1.0130us 1.45875s cudaFree
4.81% 806.02ms 406 1.9853ms 5.2450us 68.096ms cuModuleUnload
0.04% 6.1261ms 100 61.260us 45.299us 103.78us cudaLaunchKernel
0.01% 2.3611ms 6 393.52us 379.07us 438.43us cuDeviceTotalMem
0.01% 1.8451ms 564 3.2710us 131ns 128.21us cuDeviceGetAttribute
0.01% 1.6857ms 1 1.6857ms 1.6857ms 1.6857ms cudaHostAlloc
0.01% 1.4976ms 101 14.828us 11.093us 24.552us cudaStreamSynchronize
CPU
- before:
------------------------- Event Summary -------------------------
Event Calls Total Min. Max. Ave. Ratio.
thread0::expand 100 376524 3755.3 4245.9 3765.24 0.973765
thread0::expand/compute 100 376522 3755.28 4245.85 3765.22 0.973758
thread0::expand/infer_shape 100 0.840627 0.006981 0.018734 0.00840627 2.17403e-06
thread0::expand/prepare_data 100 0.189063 0.0014 0.010331 0.00189063 4.88954e-07
thread0::fetch 100 10143.2 89.0853 324.859 101.432 0.0262323
thread0::feed 100 1.0212 0.00856 0.016463 0.010212 2.64102e-06
- after:
------------------------- Event Summary -------------------------
Event Calls Total Min. Max. Ave. Ratio.
thread0::expand 100 113673 1130.89 1409.54 1136.73 0.922633
thread0::expand/compute 100 113670 1130.86 1409.49 1136.7 0.922614
thread0::expand/infer_shape 100 0.81159 0.006871 0.012413 0.0081159 6.58732e-06
thread0::expand/prepare_data 100 0.221432 0.00154 0.008571 0.00221432 1.79727e-06
thread0::fetch 100 9530.76 88.7755 359.891 95.3076 0.077357
thread0::feed 100 1.21227 0.010558 0.019541