Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • Paddle
  • Issue
  • #25872

P
Paddle
  • 项目概览

PaddlePaddle / Paddle
大约 2 年 前同步成功

通知 2325
Star 20933
Fork 5424
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 1423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
P
Paddle
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 1,423
    • Issue 1,423
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 543
    • 合并请求 543
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 7月 31, 2020 by saxon_zh@saxon_zhGuest

Verification of dygraph MKLDNN accuracy convergence

Created by: sfraczek

Quick Note: OneDNN was previously named DNNL and MKLDNN.

Instructions how to run Dygraph OneDNN training

When you merge the required pull requests, you can run training of few dygraph models with OneDNNN kernels. Some modifications to the models are still required. The models which training we are starting to support now are Mnist, ResNet, MobileNetV1, MobileNetV2.

You can prepend DNNL_VERBOSE=1 to see which primitives are created in OneDNN library to verify which ops are using OneDNN primitives.

All models info

To some training scripts you have to add a switch such as --use_gpu and disable it to use cpu because gpu is always used in i.e. ResNet. Then add it to the command.

Mnist

FLAGS_use_mkldnn=true python train.py

Mobilenet

FLAGS_use_mkldnn=true python train.py --use_gpu=False --batch_size=64 --total_images=1281167 --class_dim=1000 --image_shape=3,224,224 --model_save_dir=output/ --lr_strategy=cosine_decay --lr=0.1 --num_epochs=240 --data_dir=/data/ILSVRC2012 --l2_decay=4e-5 --model=MobileNetV2 <- or V1 You also have to add use_mkldnn=True to ops which are not imported from dygraph:

diff --git a/dygraph/mobilenet/mobilenet_v2.py b/dygraph/mobilenet/mobilenet_v2.py
index 6da031f..8541c87 100644
--- a/dygraph/mobilenet/mobilenet_v2.py
+++ b/dygraph/mobilenet/mobilenet_v2.py
@@ -66,7 +66,7 @@ class ConvBNLayer(fluid.dygraph.Layer):
         y = self._conv(inputs)
         y = self._batch_norm(y)
         if if_act:
-            y = fluid.layers.relu6(y)
+            y = fluid.layers.relu6(y, use_mkldnn=True)
         return y


@@ -112,7 +112,7 @@ class InvertedResidualUnit(fluid.dygraph.Layer):
         y = self._bottleneck_conv(y, if_act=True)
         y = self._linear_conv(y, if_act=False)
         if ifshortcut:
-            y = fluid.layers.elementwise_add(inputs, y)
+            y = fluid.layers.elementwise_add(inputs, y, use_mkldnn=True)
         return y

ResNet

FLAGS_use_mkldnn=true python train.py You also have to add use_mkldnn=True to ops which are not imported from dygraph:

diff --git a/dygraph/resnet/train.py b/dygraph/resnet/train.py
index 6bf86f9..f53c5a2 100644
--- a/dygraph/resnet/train.py
+++ b/dygraph/resnet/train.py
@@ -239,9 +239,9 @@ class BottleneckBlock(fluid.dygraph.Layer):
         else:
             short = self.short(inputs)

-        y = fluid.layers.elementwise_add(x=short, y=conv2)
+        y = fluid.layers.elementwise_add(x=short, y=conv2, use_mkldnn=True)

-        layer_helper = LayerHelper(self.full_name(), act='relu')
+        layer_helper = LayerHelper(self.full_name(), act='relu', use_mkldnn=True)
         return layer_helper.append_activation(y)

Which PRs have to be merged to run Dygraph OneDNN training

Required

  • add use_mkldnn attribute to ops in dygraph https://github.com/PaddlePaddle/Paddle/pull/25773
  • don't clear mkldnn cache in block_op executor dtor https://github.com/PaddlePaddle/Paddle/pull/25735
  • Enable mkldnn layout conversion https://github.com/PaddlePaddle/Paddle/pull/25778
  • Added Relu6 mkldnn op https://github.com/PaddlePaddle/Paddle/pull/25713

Related PRs

  • support mnist and resnet dygraph_to_static test https://github.com/PaddlePaddle/Paddle/pull/25774
  • enable check_dygraph for mkldnn activation tests https://github.com/PaddlePaddle/Paddle/pull/25779

Request for verifying training accuracy convergence

Could you please provide us with a verification of proper accuracy convergence? With limited resources it is hard for us to do. We don't have a procedure for that in place. For example mobilenet might take many days to train.

We have only been able to run full test with --ce FLAG of Mnist and Resnet (Flowers).

Mnist training

Name Result
Reference Loss at epoch 4 , Test avg_loss is: 0.0372143830631, acc is: 0.989182692308
OneDNN Loss at epoch 4 , Test avg_loss is: 0.0370462594453, acc is: 0.98828125

Resnet flowers training

Name Result
Reference final eval acc1 0.689 acc5 0.886
OneDNN final eval acc1 0.746 acc5 0.927

Final note

Since we are just starting to support OneDNN training in PaddlePaddle, there may still be some bugs with training that may impact accuracy.

指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/Paddle#25872
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7