Fork自 PaddlePaddle / Paddle
* support data_norm_op run in CUDA * add two parameters sync_stats & summary_decay_rate * add UT