Created by: gongweibao
-
Change backward_guard to optimize_guard to maximize the allreduce overlap. If we put
cast
operator in backward stage, it will wait for allredudce completion, and since paddle has only one compute stream, the stream will hang for waiting. So we need to put cast operator in optimize stage to avoid this. -
Add more logs to locate potential problem.