[cherry-pick] Improve argsort performance. (#21267) (#21442)
* Improve argsort performance.
- Give 200000 data to compute argsort on v100,
can speed up ~190x
before opt cost: 0.53s
after opt cost:0.0027s
- Add fp16 support
* Refine error message
* Refine code
* Add descending sort
test=develop
Signed-off-by: Nzhaoyuchen <zhaoyuchen01@baidu.com>
Showing
想要评论请 注册 或 登录