Verification of dygraph MKLDNN accuracy convergence (#25872) · Issue · PaddlePaddle / Paddle

Verification of dygraph MKLDNN accuracy convergence

Created by: sfraczek

Quick Note: OneDNN was previously named DNNL and MKLDNN.

Instructions how to run Dygraph OneDNN training

When you merge the required pull requests, you can run training of few dygraph models with OneDNNN kernels. Some modifications to the models are still required. The models which training we are starting to support now are Mnist, ResNet, MobileNetV1, MobileNetV2.

You can prepend DNNL_VERBOSE=1 to see which primitives are created in OneDNN library to verify which ops are using OneDNN primitives.

All models info

To some training scripts you have to add a switch such as --use_gpu and disable it to use cpu because gpu is always used in i.e. ResNet. Then add it to the command.

Mnist

FLAGS_use_mkldnn=true python train.py

Mobilenet

FLAGS_use_mkldnn=true python train.py --use_gpu=False --batch_size=64 --total_images=1281167 --class_dim=1000 --image_shape=3,224,224 --model_save_dir=output/ --lr_strategy=cosine_decay --lr=0.1 --num_epochs=240 --data_dir=/data/ILSVRC2012 --l2_decay=4e-5 --model=MobileNetV2 <- or V1 You also have to add use_mkldnn=True to ops which are not imported from dygraph:

diff --git a/dygraph/mobilenet/mobilenet_v2.py b/dygraph/mobilenet/mobilenet_v2.py
index 6da031f..8541c87 100644
--- a/dygraph/mobilenet/mobilenet_v2.py
+++ b/dygraph/mobilenet/mobilenet_v2.py
@@ -66,7 +66,7 @@ class ConvBNLayer(fluid.dygraph.Layer):
         y = self._conv(inputs)
         y = self._batch_norm(y)
         if if_act:
-            y = fluid.layers.relu6(y)
+            y = fluid.layers.relu6(y, use_mkldnn=True)
         return y


@@ -112,7 +112,7 @@ class InvertedResidualUnit(fluid.dygraph.Layer):
         y = self._bottleneck_conv(y, if_act=True)
         y = self._linear_conv(y, if_act=False)
         if ifshortcut:
-            y = fluid.layers.elementwise_add(inputs, y)
+            y = fluid.layers.elementwise_add(inputs, y, use_mkldnn=True)
         return y

ResNet

FLAGS_use_mkldnn=true python train.py You also have to add use_mkldnn=True to ops which are not imported from dygraph:

diff --git a/dygraph/resnet/train.py b/dygraph/resnet/train.py
index 6bf86f9..f53c5a2 100644
--- a/dygraph/resnet/train.py
+++ b/dygraph/resnet/train.py
@@ -239,9 +239,9 @@ class BottleneckBlock(fluid.dygraph.Layer):
         else:
             short = self.short(inputs)

-        y = fluid.layers.elementwise_add(x=short, y=conv2)
+        y = fluid.layers.elementwise_add(x=short, y=conv2, use_mkldnn=True)

-        layer_helper = LayerHelper(self.full_name(), act='relu')
+        layer_helper = LayerHelper(self.full_name(), act='relu', use_mkldnn=True)
         return layer_helper.append_activation(y)

Which PRs have to be merged to run Dygraph OneDNN training

Required

add use_mkldnn attribute to ops in dygraph https://github.com/PaddlePaddle/Paddle/pull/25773
don't clear mkldnn cache in block_op executor dtor https://github.com/PaddlePaddle/Paddle/pull/25735
Enable mkldnn layout conversion https://github.com/PaddlePaddle/Paddle/pull/25778
Added Relu6 mkldnn op https://github.com/PaddlePaddle/Paddle/pull/25713

Related PRs

support mnist and resnet dygraph_to_static test https://github.com/PaddlePaddle/Paddle/pull/25774
enable check_dygraph for mkldnn activation tests https://github.com/PaddlePaddle/Paddle/pull/25779

Request for verifying training accuracy convergence

Could you please provide us with a verification of proper accuracy convergence? With limited resources it is hard for us to do. We don't have a procedure for that in place. For example mobilenet might take many days to train.

We have only been able to run full test with --ce FLAG of Mnist and Resnet (Flowers).

Mnist training

Name	Result
Reference	Loss at epoch 4 , Test avg_loss is: 0.0372143830631, acc is: 0.989182692308
OneDNN	Loss at epoch 4 , Test avg_loss is: 0.0370462594453, acc is: 0.98828125

Resnet flowers training

Name	Result
Reference	final eval acc1 0.689 acc5 0.886
OneDNN	final eval acc1 0.746 acc5 0.927

Final note

Since we are just starting to support OneDNN training in PaddlePaddle, there may still be some bugs with training that may impact accuracy.

PaddlePaddle / Paddle 1 年多 前同步成功