Verification of dygraph MKLDNN accuracy convergence
Created by: sfraczek
Quick Note: OneDNN was previously named DNNL and MKLDNN.
Instructions how to run Dygraph OneDNN training
When you merge the required pull requests, you can run training of few dygraph models with OneDNNN kernels. Some modifications to the models are still required. The models which training we are starting to support now are Mnist, ResNet, MobileNetV1, MobileNetV2.
You can prepend DNNL_VERBOSE=1
to see which primitives are created in OneDNN library to verify which ops are using OneDNN primitives.
All models info
To some training scripts you have to add a switch such as --use_gpu
and disable it to use cpu because gpu is always used in i.e. ResNet. Then add it to the command.
Mnist
FLAGS_use_mkldnn=true python train.py
Mobilenet
FLAGS_use_mkldnn=true python train.py --use_gpu=False --batch_size=64 --total_images=1281167 --class_dim=1000 --image_shape=3,224,224 --model_save_dir=output/ --lr_strategy=cosine_decay --lr=0.1 --num_epochs=240 --data_dir=/data/ILSVRC2012 --l2_decay=4e-5 --model=MobileNetV2
<- or V1
You also have to add use_mkldnn=True
to ops which are not imported from dygraph:
diff --git a/dygraph/mobilenet/mobilenet_v2.py b/dygraph/mobilenet/mobilenet_v2.py
index 6da031f..8541c87 100644
--- a/dygraph/mobilenet/mobilenet_v2.py
+++ b/dygraph/mobilenet/mobilenet_v2.py
@@ -66,7 +66,7 @@ class ConvBNLayer(fluid.dygraph.Layer):
y = self._conv(inputs)
y = self._batch_norm(y)
if if_act:
- y = fluid.layers.relu6(y)
+ y = fluid.layers.relu6(y, use_mkldnn=True)
return y
@@ -112,7 +112,7 @@ class InvertedResidualUnit(fluid.dygraph.Layer):
y = self._bottleneck_conv(y, if_act=True)
y = self._linear_conv(y, if_act=False)
if ifshortcut:
- y = fluid.layers.elementwise_add(inputs, y)
+ y = fluid.layers.elementwise_add(inputs, y, use_mkldnn=True)
return y
ResNet
FLAGS_use_mkldnn=true python train.py
You also have to add use_mkldnn=True
to ops which are not imported from dygraph:
diff --git a/dygraph/resnet/train.py b/dygraph/resnet/train.py
index 6bf86f9..f53c5a2 100644
--- a/dygraph/resnet/train.py
+++ b/dygraph/resnet/train.py
@@ -239,9 +239,9 @@ class BottleneckBlock(fluid.dygraph.Layer):
else:
short = self.short(inputs)
- y = fluid.layers.elementwise_add(x=short, y=conv2)
+ y = fluid.layers.elementwise_add(x=short, y=conv2, use_mkldnn=True)
- layer_helper = LayerHelper(self.full_name(), act='relu')
+ layer_helper = LayerHelper(self.full_name(), act='relu', use_mkldnn=True)
return layer_helper.append_activation(y)
Which PRs have to be merged to run Dygraph OneDNN training
Required
- add use_mkldnn attribute to ops in dygraph https://github.com/PaddlePaddle/Paddle/pull/25773
- don't clear mkldnn cache in block_op executor dtor https://github.com/PaddlePaddle/Paddle/pull/25735
- Enable mkldnn layout conversion https://github.com/PaddlePaddle/Paddle/pull/25778
- Added Relu6 mkldnn op https://github.com/PaddlePaddle/Paddle/pull/25713
Related PRs
- support mnist and resnet dygraph_to_static test https://github.com/PaddlePaddle/Paddle/pull/25774
- enable check_dygraph for mkldnn activation tests https://github.com/PaddlePaddle/Paddle/pull/25779
Request for verifying training accuracy convergence
Could you please provide us with a verification of proper accuracy convergence? With limited resources it is hard for us to do. We don't have a procedure for that in place. For example mobilenet might take many days to train.
We have only been able to run full test with --ce FLAG of Mnist and Resnet (Flowers).
Mnist training
Name | Result |
---|---|
Reference | Loss at epoch 4 , Test avg_loss is: 0.0372143830631, acc is: 0.989182692308 |
OneDNN | Loss at epoch 4 , Test avg_loss is: 0.0370462594453, acc is: 0.98828125 |
Resnet flowers training
Name | Result |
---|---|
Reference | final eval acc1 0.689 acc5 0.886 |
OneDNN | final eval acc1 0.746 acc5 0.927 |
Final note
Since we are just starting to support OneDNN training in PaddlePaddle, there may still be some bugs with training that may impact accuracy.