Created by: hutuxian
Previously, DataNorm OP was only able to run in CPU. This PR implements its GPU kernel. What's more, in GPU Kernel, we update three summary messages in its grad part.