Created by: panyx0718
On TitanX with 4 devices without nccl, the se-resnext step time reduces from ~1.18 to ~1.10