implement dygraph.parallel.DataParallel to hook reduce op.
add NCCLParallelContext for parallel dygraph