remove nccl related code in windows
* support data_norm_op run in CUDA * add two parameters sync_stats & summary_decay_rate * add UT