Created by: wangchaochaohu
optimize the mean op V100 cuda10 cudnn7 test code
input = fluid.layers.data(
name='data', shape=[100, 100, 10000], dtype='float32')
result = fluid.layers.mean(input)
x_ndarray = np.ones([100, 100, 10000]).astype(np.float32)
exe = fluid.Executor(fluid.CUDAPlace(0))
exe.run(fluid.default_startup_program())
all_result, = exe.run(feed={'data':x_ndarray}, fetch_list=[result])
only test the compute function time(using cudaEvent) develop: cost time: 2921.58ms this pr: time cost: 0.54432ms about reduce https://devblogs.nvidia.com/faster-parallel-reductions-kepler/