Dev Batch Permute (#6441)
* dev torch style permute kernel
* Refine
* fix batch permute launch condition
* fix batch permute dispatch logic
* remove redundant header file
* simplified check logic
* use permute primitives in transpose kernels
* fix batch permute logic and avoid mod
* remove redundant templates
* fix grid step
* add grid for loop to avoid the elementnum is too large
* fix bug when hw is not divided by tile size
* refine format
* add a copy kernel as a baseline
* remove annotation
* add copy kernel
* add sync
* use batch permute for profile
* add copy tile baseline
* simplify params for copy kernel
* add slow copy kernel
* use mul to instead mod and remove copy
* use movement size = 4 when h w is modify by 2
* Add temp process for half2
* add half2 specialized kernel
* remove redundant license
* simplified code
* fix format
* fix comment
* fix comment
* use bad for loop condition
* merge half2 in load
* fix bad for loop in batch permute
* refine
* use align storage
* refine
* fix comment
* fix comment
* fix format
* add const and remove redundant header file
* remove register macro
* refine cuda code
* fix guoran comment
* fix format
* fix some details
* remove cuda graph
* fix for 0d tensor
Co-authored-by: Noneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
Showing
想要评论请 注册 或 登录