Fork自 PaddlePaddle / PaddleDetection
* add cuda_device_functions.h * move reduceSum to elementwise_op_function.h