* refine structure for cuda and rocm * update * update * update * update
* add alltoall api, test=develop