Created by: qingqing01
- Now only support GPU, CPU will be add in nexted PR.
- Use ncclAllReduce to sync E(x) and E(x^2) on multi-gpus in one machine.
3. AddncclComm_t
in DeviceContext and initialized by init.cc - The unit testing is to compare the forward outputs and backward outputs with batch_norm on one GPU.
build_strategy = fluid.BuildStrategy()
build_strategy.sync_batch_norm = True
binary = fluid.compiler.CompiledProgram(tp).with_data_parallel(
loss_name=loss_mean.name,
build_strategy=build_strategy)