• B
    [cherry-pick 2.5] Broadcast && Dropout_nd Performance Optimization into Release/2.5 (#53623) · f9ea2301
    Bo Zhang 提交于
    * Support different dtypes of inputs for broadcast for dropout optimization  (#52093)
    
    * change judgement for DropoutGradGPUKernelDriver
    
    * add UnrollerWithoutVecSize and after this Loaddata to be refined
    
    * pass unittest
    
    * use same unroller with XPU
    
    * BroadcastWithInt64Index
    
    * BroadcastDataLoader template partial specialization
    
    * fix compile errs in ROCms
    
    * PR comment
    
    * dropout_nd_optimization (#51479)
    
    * with printf
    
    * add DropOutNdForwardKernel
    
    * PR comment
    
    * Dropout optimize & clean broadcast inT and ElementwiseType (#52969)
    
    * change judgement for DropoutGradGPUKernelDriver
    
    * add UnrollerWithoutVecSize and after this Loaddata to be refined
    
    * pass unittest
    
    * use same unroller with XPU
    
    * BroadcastWithInt64Index
    
    * BroadcastDataLoader template partial specialization
    
    * fix compile errs in ROCms
    
    * clean ElementwiseT and InT for BroadcastKernel
    
    * default axis and clean inT
    
    * remove redundant fast divmod computation
    
    * optimize drop_nd & drop_nd_grad
    
    * optimize BroadcastDataLoader bf16 fp16
    
    * rm InT etc. after merge develop
    
    * delete constexpr for windows ci
    
    * fix conflict
    
    * fix conflic with develop
    
    * fix conflic
    
    * new clean
    
    * clean
    
    * Fix xpu2 kp compile error (#53548)
    
    * fix conflict
    
    * conflict
    f9ea2301
fmha_ref.h 28.1 KB