Fork自 PaddlePaddle / PaddleDetection
* Speed up elemwise grad * Fix bug * Add macro for MAX_BLOCK_DIM