Created by: mapingshuo
Problem
When training the bert-large model on 4 V100 cards, I found that the softmax operator runs very slow. I turn use_cudnn
in softmax API to True
, and the training process obtains 20% speed accelerated. We should turn the default value of use_cudnn
to True
since it is 10 times quicker than the non-cudnn one.
speed | |
---|---|
without cudnn | 11.1 steps/s |
with cudnn | 13.1 steps/s |
profile without cudnn:
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
softmax 24 41.3891 2.063354 (0.049853) 39.325744 (0.950147) 1.702 1.81038 1.72455 0.0165059
softmax_grad 24 93.6806 5.729520 (0.061160) 87.951125 (0.938840) 3.42631 6.93811 3.90336 0.0373596
profile with cudnn:
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
softmax 24 9.38122 0.746789 (0.079605) 8.634429 (0.920395) 0.360629 0.4713 0.390884 0.00478599
softmax_grad 24 9.14636 1.081652 (0.118260) 8.064708 (0.881740) 0.330742 0.640794 0.381098 0.00466618
Reproduce
git clone https://github.com/PaddlePaddle/Fleet.git
cd Fleet/benchmark/collective/bert
sh train.sh
The way to turn on use_cudnn=True
in model/transformer_encoder.py
is as follows:
weights = layers.softmax(product, use_cudnn=True)