Fork自 PaddlePaddle / Paddle
* fix cuda atomicAdd for FP16 * try to fix ci
* refine structure for cuda and rocm * update * update * update * update