Retry CUDA Initialization to Fix Random Failure, test=develop (#28323)
This PR is follow up of #28213. On that PR we tried to decrease GPU usage, however the CI still randomly failed. So I added retry logic for the initialization of nccl and cusolver. If the initialization failed, we can retry to avoid the random failure.
Showing
想要评论请 注册 或 登录