Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • Paddle
  • Issue
  • #19425

P
Paddle
  • 项目概览

PaddlePaddle / Paddle
大约 2 年 前同步成功

通知 2325
Star 20933
Fork 5424
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 1423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
Paddle
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 1,423
    • Issue 1,423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
    • 合并请求 543
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 8月 26, 2019 by saxon_zh@saxon_zhGuest6 of 7 tasks completed6/7 tasks

Speed up resnet model's throughput on fp16

Created by: gongweibao

Target:

Resnet50, batchsize 128 or 256 1400 images / v100 gpu card 10000+ images/v100 machine

TODO

  • 1. NHWC According to https://devblogs.nvidia.com/tensor-core-ai-performance-milestones/?linkId=100000002357715

The tensors operated on by Tensor Cores should be in a channel-interleaved data layout in memory (Number-Height-Width-Channel, often called NHWC) in order to get the best performance

  • batchnorm
  • conv
  • pool

other:?

  • pool_with_index

  • conv_tranpose

  • prior_box

  • shuffle_channel

  • affine_channel

  • MaxOutFunctor

  • 2. elementwise According to nvprof result: elmentwise(add,mul) op's kernel should be speed up:

95             Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 96  GPU activities:   14.04%  5.32175s     45200  117.74us  2.2390us  598.30us  void nchwToNhwcKernel<__half, __half, float, bool=1, bool=0>(int, int, int, int, __half cons    t *, __half*, float, float)
 97                    12.15%  4.60367s      6800  677.01us  62.463us  3.0237ms  void cudnn::detail::bn_bw_1C11_kernel_new<__half, float, float2, int=512, bool=1, int=1>(flo    at, cudnn::detail::bn_bw_1C11_kernel_new<__half, float, float2, int=512, bool=1, int=1>, cudnn::detail::bn_bw_1C11_kernel_new<__half, float, float2, int=512, bool=1, int    =1>, cudnn::detail::bn_bw_1C11_kernel_new<__half, float, float2, int=512, bool=1, int=1>, cudnnTensorStruct, __half const *, float, __half const , float, cudnnTensorStru    ct*, cudnn::detail::bn_bw_1C11_kernel_new<__half, float, float2, int=512, bool=1, int=1> const *, cudnn::detail::bn_bw_1C11_kernel_new<__half, float, float2, int=512, bo    ol=1, int=1>*, cudnn::detail::bn_bw_1C11_kernel_new<__half, float, float2, int=512, bool=1, int=1> const *, cudnn::detail::bn_bw_1C11_kernel_new<__half, float, float2, i    nt=512, bool=1, int=1> const , cudnn::detail::bn_bw_1C11_kernel_new<__half, float, float2, int=512, bool=1, int=1> const , cudnn::detail::bn_bw_1C11_kernel_new<__half, f    loat, float2, int=512, bool=1, int=1>)
 98                     7.39%  2.80034s      6400  437.55us  114.91us  1.1186ms  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::Te    nsorMap<Eigen::Tensor<paddle::platform::float16, int=1, int=1, long>, int=0, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_sum_op<paddle::platf    orm::float16 const , paddle::platform::float16 const >, Eigen::TensorMap<Eigen::Tensor<paddle::platform::float16 const , int=1, int=1, long>, int=0, Eigen::MakePointer>     const , Eigen::TensorMap<Eigen::Tensor<paddle::platform::float16 const , int=1, int=1, long>, int=0, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>    (paddle::platform::float16, int=1)
 99                     6.35%  2.40750s      9800  245.66us  30.688us  1.1385ms  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::Te    nsorMap<Eigen::Tensor<paddle::platform::float16, int=1, int=1, long>, int=0, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_product_op<paddle::p    latform::float16 const , paddle::platform::float16 const >, Eigen::TensorMap<Eigen::Tensor<paddle::platform::float16 const , int=1, int=1, long>, int=0, Eigen::MakePoint    er> const , Eigen::TensorConversionOp<paddle::platform::float16, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_cmp_op<paddle::platform::float16 const , paddle::plat    form::float16 const , Eigen::internal::ComparisonName>, Eigen::TensorMap<Eigen::Tensor<paddle::platform::float16 const , int=1, int=1, long>, int=0, Eigen::MakePointer>     const , Eigen::TensorCwiseNullaryOp<Eigen::internal::scalar_constant_op<paddle::platform::float16 const >, Eigen::TensorMap<Eigen::Tensor<paddle::platform::float16 const     , int=1, int=1, long>, int=0, Eigen::MakePointer> const > const > const > const > const > const , Eigen::GpuDevice>, long>(paddle::platform::float16, int=1)
100                     5.68%  2.15245s      5000  430.49us  123.65us  1.3419ms  void cudnn::detail::bn_fw_tr_1C11_kernel_NCHW<__half, float, int=512, bool=1, int=1>(cudnnTe    nsorStruct, __half const *, cudnn::detail::bn_fw_tr_1C11_kernel_NCHW<__half, float, int=512, bool=1, int=1>, cudnnTensorStruct*, float const *, float const , cudnnTensor    Struct*, cudnnTensorStruct*, cudnnTensorStruct**, float const *, float const *, float const *, cudnnTensorStruct*, cudnnTensorStruct*)
101                     5.34%  2.02383s      6200  326.42us  204.51us  646.55us  volta_fp16_s884cudnn_fp16_128x128_ldg8_dgrad_f2f_exp_interior_nhwc2nchw_tt_v1
102                     4.65%  1.76357s      6600  267.21us  179.52us  392.35us  volta_s884cudnn_fp16_128x128_ldg8_wgrad_idx_exp_interior_nhwc_nt
103                     4.46%  1.69164s      9800  172.62us  19.872us  785.18us  void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::Te    nsorMap<Eigen::Tensor<paddle::platform::float16, int=1, int=1, long>, int=0, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_max_op<paddle::platf    orm::float16 const , paddle::platform::float16 const >, Eigen::TensorMap<Eigen::Tensor<paddle::platform::float16 const , int=1, int=1, long>, int=0, Eigen::MakePointer>     const , Eigen::TensorCwiseNullaryOp<Eigen::internal::scalar_constant_op<paddle::platform::float16 const >, Eigen::TensorMap<Eigen::Tensor<paddle::platform::float16 const     , int=1, int=1, long>, int=0, Eigen::MakePointer> const > const > const > const , Eigen::GpuDevice>, long>(paddle::platform::float16, int=1)
104                     3.83%  1.45276s      3800  382.30us  258.43us  592.63us  volta_s884cudnn_fp16_64x64_sliced1x4_ldg8_wgrad_idx_exp_interior_nhwc_nt
105                     3.47%  1.31410s      3200  410.66us  218.59us  576.73us  volta_fp16_s884cudnn_fp16_256x128_ldg8_relu_filter1x1_stg8_interior_nchw_nn_v1
  • 3. Move cudnn cache to global scope
  • 4. Add DALI support for data preprocess. - Paddle iterator.
  • 5. Verify fp16optimizer's functionationlity. - AMP
  • 6. Check exhaustive search on fp16.
  • 7. Assert all op use same layout(such as NHWC) in compilation.
指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/Paddle#19425
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7