Speed up bert train performance
Created by: gongweibao
Speed up methods:
- Mixed precision.
- Use selectedrows allreduce.
-
Better multithread instead of scheduler.
- Use multiprocess instead.
-
Resolve operator bottleneck. Such as
dropout
. - Fuse small but massive operators.
-
Collective ophandles run overlap with other operators.
- Does fuse is useful?
-
Deep Gradient Compression
- Can't speed up on 100Gb+IB network.