Created by: danleifeng
in fp16_util.py, update_role_var_grad function will change cast op role, which makes effect in parallelExecutor. But in Executor, it may cause errors without nccl synchronization.
To solve this problem, we can move optimize_role ops behind all the backward_role ops. It can also speed up Executor training.