* Fixed allclose_op bug, which cannot deal with some cases of fp64 inputs. * improved CUDA kernel performance. * Fixed a bug in cuda kernel which cannot deal with large dimension input, and added an unit test for it. * Add a test case for float32 input.