Fork自 PaddlePaddle / Paddle
remove nccl related code in windows
* support data_norm_op run in CUDA * add two parameters sync_stats & summary_decay_rate * add UT