• L
    H2D data transfer optimization for split kernel (#49086) · 057ba778
    limingshu 提交于
    * profile reduce kernel for fp16 and reduceHigherdim
    
    * use reinterpret_cast
    
    * fix for CI on ROCm
    
    * add Macro for ROCm
    
    * ROCm CI config
    
    * ROCm CI config
    
    * unit test repair
    
    * pull
    
    * add common_funcs.h
    
    * reduceType
    
    * Update reduce_function.h
    
    * not higher
    
    * rename
    
    * implement of matmul using cublasLt instead of cublas
    
    * cublasLt bugfix
    
    * Update matmul_kernel_impl.h
    
    * Update matmul_kernel_impl_via_blasLt.h
    
    * for-loop-algo
    
    * PR comments changes
    
    * add macro
    
    * ci unused variable isCublasLt
    
    * ci unused variable isCublasLt macro
    
    * split matmul to autotune
    
    * rewrite the split kernel with segmented_array
    
    * rewrite the split kernel with segmented_array
    
    * rewrite the split kernel with segmented_array
    
    * add some method for cuda_graph
    
    * fix bugs for rocm
    
    * change for ci-error
    
    * i dont know why ci-model-benchmark gives a shit error, so i recover codes with original one to see if original codes work.
    
    * add some changes for passing mode_benchmark and coverage ci
    
    * fix ci error
    
    * fix ci-rocm error
    
    * add some changes for header
    
    ---------
    Co-authored-by: Nzhangbopd <1299246947@qq.com>
    Co-authored-by: NBo Zhang <105368690+zhangbopd@users.noreply.github.com>
    057ba778
concat_and_split_functor.cu 35.1 KB