* refine structure for cuda and rocm * update * update * update * update
* fix c_split bug * fix utest * add c_embedding for tensorparallel