Created by: panyx0718
ncclinit takes a lot of time, blocking the gpu kernels.
This change caches ncclComm and only reinit when num_gpus changed