SE-ResNeXt Optimization
Created by: jacquesqiao
Background
project: https://github.com/PaddlePaddle/Paddle/projects/55 Profiling script:
Optimization methods and result
- Delete unused GPU memory during training.
- It will slow down the training a little bit because GPU is an async device. However, it will reduce GPU memory huger than reusing variables. (reduce 54.3% memory usage than 45.5%) https://github.com/PaddlePaddle/Paddle/pull/8690
- remove program.clone in Executor. (25% speedup) https://github.com/PaddlePaddle/Paddle/issues/8729
- initialize NCCL once. (5%~6% speedup) https://github.com/PaddlePaddle/Paddle/issues/8758
- use constant folding at compile time to reduce the number of calls to elementwise_mul ops at optimization time (5%~10% speedup) https://github.com/PaddlePaddle/Paddle/issues/8873
- optimize elementwise related op -- use our own implementations, no longer depend on Eigen (speedup x10 for single op) https://github.com/PaddlePaddle/Paddle/issues/8811
Status
- multi cards training has not been fully tested.
- need to profile acceleration ratio for multi cards.
Plan
Give a total profile after all the optimization is merged (@chengduoZH )