Created by: zhaoyuchen2018
-
Give 200000 data to compute argsort on v100, can speed up ~190x before opt cost: 0.53s after opt cost:0.0027s
-
Add fp16 support.
-
FIx memory core dump issue: https://github.com/PaddlePaddle/Paddle/issues/20021
test=develop
Signed-off-by: zhaoyuchen zhaoyuchen01@baidu.com