-
由 zhaoyuchen2018 提交于
* Improve elementwise performance. Elementwise performace is poor as walk into CommonGradBroadcastCUDA, add some new kernels for different data pattern. * Add some cuda kernel to speedup common broadcast cases. test=develop * Add more test cases and fix cuda kernel bug. test=develop * Remove tests as cpu percision fails.test=develop * Refine SplitDims, test=develop * Change file mode, test=develop
58615a62