Created by: qingqing01
Fix https://github.com/PaddlePaddle/Paddle/issues/6283
At first, we add this CUDA stream synchronization in the operator developing period to detect the CUDA error of each CUDA kernel. When the framework is stable, this synchronization should be removed to speed up training.