SE-ResNeXt Optimization (#8990) · Issue · PaddlePaddle / Paddle

SE-ResNeXt Optimization

Created by: jacquesqiao

Delete unused GPU memory during training.
- It will slow down the training a little bit because GPU is an async device. However, it will reduce GPU memory huger than reusing variables. (reduce 54.3% memory usage than 45.5%) https://github.com/PaddlePaddle/Paddle/pull/8690
remove program.clone in Executor. (25% speedup) https://github.com/PaddlePaddle/Paddle/issues/8729
initialize NCCL once. (5%~6% speedup) https://github.com/PaddlePaddle/Paddle/issues/8758
use constant folding at compile time to reduce the number of calls to elementwise_mul ops at optimization time (5%~10% speedup) https://github.com/PaddlePaddle/Paddle/issues/8873
optimize elementwise related op -- use our own implementations, no longer depend on Eigen (speedup x10 for single op) https://github.com/PaddlePaddle/Paddle/issues/8811

Give a total profile after all the optimization is merged (@chengduoZH )