Created by: zhaoyuchen2018
- Improve argsort performance.
-
Give 200000 data to compute argsort on v100, can speed up ~190x before opt cost: 0.53s after opt cost:0.0027s
-
Add fp16 support
- Refine error message
- Refine code
- Add descending sort
test=develop
Signed-off-by: zhaoyuchen zhaoyuchen01@baidu.com