Created by: tonyyang-svail
Part of the optimization of parallel_do
This PR contains the following:
- add nccl library to the framework
- add nccl callback on backward
- add nncl flag in parallel do
- use asign op to overwrite the reduced gradient
- verify the correctness of parallel_do with nccl