Created by: zhiqiu
Original cuda kernel of softmax_with_cross_entropy op may produce wrong results when (1) soft_label = False, (2) numeric_stable_mode = True, (3) axis_dim is large, (eg, axis_dim > 1000).
This PR fixes that problem by adding necessary synchronization in RowReductionForDiffMaxSum
.