-
由 ZZK 提交于
* dev torch style permute kernel * Refine * fix batch permute launch condition * fix batch permute dispatch logic * remove redundant header file * simplified check logic * use permute primitives in transpose kernels * fix batch permute logic and avoid mod * remove redundant templates * fix grid step * add grid for loop to avoid the elementnum is too large * fix bug when hw is not divided by tile size * refine format * add a copy kernel as a baseline * remove annotation * add copy kernel * add sync * use batch permute for profile * add copy tile baseline * simplify params for copy kernel * add slow copy kernel * use mul to instead mod and remove copy * use movement size = 4 when h w is modify by 2 * Add temp process for half2 * add half2 specialized kernel * remove redundant license * simplified code * fix format * fix comment * fix comment * use bad for loop condition * merge half2 in load * fix bad for loop in batch permute * refine * use align storage * refine * fix comment * fix comment * fix format * add const and remove redundant header file * remove register macro * refine cuda code * fix guoran comment * fix format * fix some details * remove cuda graph * fix for 0d tensor Co-authored-by: Noneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
bca2e098