Created by: zhaoyuchen2018
- Improve topk performance.
give 200000 data to compute topk, before opt: cost 1s after opt: cost 0.0028s.
- Refine return value.
- Add cuda util funtions.
- Fix ComputeBlockSize bug & refine comments.
Signed-off-by: zhaoyuchen zhaoyuchen01@baidu.com