enable cudnn in softmax API [WIP] (!23910) · 合并请求 · PaddlePaddle / Paddle

enable cudnn in softmax API [WIP] !23910

Created by: mapingshuo

Problem

When training the bert-large model on 4 V100 cards, I found that the softmax operator runs very slow. I turn use_cudnn in softmax API to True, and the training process obtains 20% speed accelerated. We should turn the default value of use_cudnn to True since it is 10 times quicker than the non-cudnn one.

	speed
without cudnn	11.1 steps/s
with cudnn	13.1 steps/s

profile without cudnn:

Event                                                       Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
softmax                                                     24          41.3891     2.063354 (0.049853)     39.325744 (0.950147)    1.702       1.81038     1.72455     0.0165059
softmax_grad                                                24          93.6806     5.729520 (0.061160)     87.951125 (0.938840)    3.42631     6.93811     3.90336     0.0373596

profile with cudnn:

Event                                                       Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
softmax                                                     24          9.38122     0.746789 (0.079605)     8.634429 (0.920395)     0.360629    0.4713      0.390884    0.00478599
softmax_grad                                                24          9.14636     1.081652 (0.118260)     8.064708 (0.881740)     0.330742    0.640794    0.381098    0.00466618

Reproduce

git clone https://github.com/PaddlePaddle/Fleet.git
cd Fleet/benchmark/collective/bert
sh  train.sh

The way to turn on use_cudnn=True in model/transformer_encoder.py is as follows:

        weights = layers.softmax(product, use_cudnn=True)

PaddlePaddle / Paddle 1 年多 前同步成功

enable cudnn in softmax API [WIP] !23910

Problem

Reproduce

PaddlePaddle / Paddle
1 年多前同步成功