use 32 bit index to improve expand op (!23899) · 合并请求 · PaddlePaddle / Paddle

use 32 bit index to improve expand op !23899

Created by: zhangting2020

background

Related to: benchmark #335. The default type of tensor's index is long in Eigen, this PR use 32 bit (int) index to improve expand op.
test case:

config = ExpandConfig(x_shape=[32, 807, 1],
                      expand_times=[4, 1, 807])

perfomance

GPU：CUDA10，v100，use nvprof

before	after	speed up
1.2544 ms	0.7556 ms	39.8%

CPU：6148，use profile

before	after	speed up
3765.22 ms	1136.7 ms	69.8%

details

GPU

before:

W0325 06:53:01.038626 29504 device_context.cc:235] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 10.1, Runtime API Version: 10.0
W0325 06:53:01.044080 29504 device_context.cc:243] device: 0, cuDNN Version: 7.5.
feed random data
{
  framework: "paddle",
  version: "0.0.0",
  name: "expand",
  device: "GPU",
  speed: { repeat: 100, start: 10, end: 90, total: 777.73128, feed: 0.00000, compute: 0.00000, fetch: 0.00000 }
}
==29504== Profiling application: python expand.py --dtype float32 --run_with_executor True --check_output False --profiler none --backward False --use_gpu True --repeat 100 --log_level 0
==29504== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   97.98%  6.19575s       100  61.958ms  54.682ms  339.99ms  [CUDA memcpy DtoH]
                    1.98%  125.44ms       100  1.2544ms  1.2529ms  1.2604ms  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=3, int=1, long>, int=0, Eigen::MakePointer>, Eigen::TensorBroadcastingOp<Eigen::DSizes<int, int=3> const , Eigen::TensorMap<Eigen::Tensor<float const , int=3, int=1, long>, int=0, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=3)
                    0.03%  2.0082ms       103  19.497us  2.4960us  29.215us  [CUDA memcpy HtoD]
                    0.00%  7.5190us         4  1.8790us  1.5990us  2.7200us  [CUDA memset]
      API calls:   47.07%  6.35675s       203  31.314ms  19.971us  343.26ms  cudaMemcpy
                   22.75%  3.07266s         8  384.08ms  1.6010us  3.07231s  cudaStreamCreateWithFlags
                    9.88%  1.33484s         1  1.33484s  1.33484s  1.33484s  cudaStreamCreate
                    9.74%  1.31526s         5  263.05ms  23.420us  1.31509s  cudaMemGetInfo
                    6.07%  819.79ms       406  2.0192ms  5.6880us  66.099ms  cuModuleUnload
                    4.28%  578.66ms         4  144.67ms     750ns  578.66ms  cudaFree
                    0.10%  12.846ms        14  917.58us  5.2290us  12.290ms  cudaMalloc
                    0.04%  5.9010ms       100  59.010us  53.821us  73.255us  cudaLaunchKernel

after：

W0416 04:00:06.941998 19678 device_context.cc:237] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 10.1, Runtime API Version: 10.0
W0416 04:00:07.149387 19678 device_context.cc:245] device: 0, cuDNN Version: 7.5.
{
  framework: "paddle",
  version: "0.0.0",
  name: "expand",
  device: "GPU",
  speed: { repeat: 100, start: 10, end: 90, total: 711.64674, feed: 0.00000, compute: 0.00000, fetch: 0.00000 }
}
==19678== Profiling application: python expand.py --dtype float32 --run_with_executor True --check_output False --profiler none --backward False --use_gpu True --repeat 100 --log_level 0
==19678== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   98.84%  6.58699s       100  65.870ms  61.893ms  306.97ms  [CUDA memcpy DtoH]
                    1.13%  75.555ms       100  755.55us  755.13us  757.56us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=3, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorBroadcastingOp<Eigen::DSizes<int, int=3> const , Eigen::TensorMap<Eigen::Tensor<float const , int=3, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=3)
                    0.03%  2.0074ms       103  19.489us  2.2720us  29.664us  [CUDA memcpy HtoD]
                    0.00%  7.9670us         4  1.9910us  1.6000us  2.3990us  [CUDA memset]
      API calls:   40.03%  6.70104s       203  33.010ms  22.238us  309.42ms  cudaMemcpy
                   37.08%  6.20759s         8  775.95ms  1.8530us  6.20726s  cudaStreamCreateWithFlags
                    9.26%  1.55004s         1  1.55004s  1.55004s  1.55004s  cudaStreamCreate
                    8.71%  1.45876s         4  364.69ms  1.0130us  1.45875s  cudaFree
                    4.81%  806.02ms       406  1.9853ms  5.2450us  68.096ms  cuModuleUnload
                    0.04%  6.1261ms       100  61.260us  45.299us  103.78us  cudaLaunchKernel
                    0.01%  2.3611ms         6  393.52us  379.07us  438.43us  cuDeviceTotalMem
                    0.01%  1.8451ms       564  3.2710us     131ns  128.21us  cuDeviceGetAttribute
                    0.01%  1.6857ms         1  1.6857ms  1.6857ms  1.6857ms  cudaHostAlloc
                    0.01%  1.4976ms       101  14.828us  11.093us  24.552us  cudaStreamSynchronize

CPU

before:

-------------------------       Event Summary       -------------------------

Event                               Calls       Total       Min.        Max.        Ave.        Ratio.
thread0::expand                     100         376524      3755.3      4245.9      3765.24     0.973765
  thread0::expand/compute           100         376522      3755.28     4245.85     3765.22     0.973758
  thread0::expand/infer_shape       100         0.840627    0.006981    0.018734    0.00840627  2.17403e-06
  thread0::expand/prepare_data      100         0.189063    0.0014      0.010331    0.00189063  4.88954e-07
thread0::fetch                      100         10143.2     89.0853     324.859     101.432     0.0262323
thread0::feed                       100         1.0212      0.00856     0.016463    0.010212    2.64102e-06

after:

-------------------------       Event Summary       -------------------------

Event                               Calls       Total       Min.        Max.        Ave.        Ratio.
thread0::expand                     100         113673      1130.89     1409.54     1136.73     0.922633
  thread0::expand/compute           100         113670      1130.86     1409.49     1136.7      0.922614
  thread0::expand/infer_shape       100         0.81159     0.006871    0.012413    0.0081159   6.58732e-06
  thread0::expand/prepare_data      100         0.221432    0.00154     0.008571    0.00221432  1.79727e-06
thread0::fetch                      100         9530.76     88.7755     359.891     95.3076     0.077357
thread0::feed                       100         1.21227     0.010558    0.019541

PaddlePaddle / Paddle 大约 1 年 前同步成功

use 32 bit index to improve expand op !23899

background

perfomance

details

GPU

CPU

PaddlePaddle / Paddle
大约 1 年前同步成功