Created by: ktlichkid
Sequence expand op's GPU grad kernel implementation is not robust enough if memory optimizer is on.
The GPU kernel directly computed the sum of gradient without checking the initial value in d_x tensor.
In this PR, I moved the "set zero" function outside the functor to guarantee d_x is set to zero both on CPU and GPU.