Fork自 PaddlePaddle / PaddleDetection
* refine seq_softmax_op * fix seq_softmax * use cub in seq_softmax