* fix cuda atomicAdd for FP16 * try to fix ci
* refine structure for cuda and rocm * update * update * update * update