Remove the CUDA stream synchronization between each operator.
Created by: qingqing01
At first, we add this CUDA stream synchronization in the operator developing period to detect the CUDA error of each CUDA kernel. When the framework is stable, this synchronization should be removed to speed up training.