• Z
    Dev Batch Permute (#6441) · bca2e098
    ZZK 提交于
    * dev torch style permute kernel
    
    * Refine
    
    * fix batch permute launch condition
    
    * fix batch permute dispatch logic
    
    * remove redundant header file
    
    * simplified check logic
    
    * use permute primitives in transpose kernels
    
    * fix batch permute logic and avoid mod
    
    * remove redundant templates
    
    * fix grid step
    
    * add grid for loop to avoid the elementnum is too large
    
    * fix bug when hw is not divided by tile size
    
    * refine format
    
    * add a copy kernel as a baseline
    
    * remove annotation
    
    * add copy kernel
    
    * add sync
    
    * use batch permute for profile
    
    * add copy tile baseline
    
    * simplify params for copy kernel
    
    * add slow copy kernel
    
    * use mul to instead mod and remove copy
    
    * use movement size = 4 when h w is modify by 2
    
    * Add temp process for half2
    
    * add half2 specialized kernel
    
    * remove redundant license
    
    * simplified code
    
    * fix format
    
    * fix comment
    
    * fix comment
    
    * use bad for loop condition
    
    * merge half2 in load
    
    * fix bad for loop in batch permute
    
    * refine
    
    * use align storage
    
    * refine
    
    * fix comment
    
    * fix comment
    
    * fix format
    
    * add const and remove redundant header file
    
    * remove register macro
    
    * refine cuda code
    
    * fix guoran comment
    
    * fix format
    
    * fix some details
    
    * remove cuda graph
    
    * fix for 0d tensor
    Co-authored-by: Noneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
    bca2e098
transpose_kernel.cpp 2.7 KB