• M
    cuBlasLt Epilogue To Fuse Linear + ReLU|GeLU (#39437) · 2a3d9eca
    Ming-Xu Huang 提交于
    * Added cuBlasLtHandle_t to device context.
    
    * Added fused_gemm_epilogue op.
    
    1. Added fused_gemm_epilogue op to leverage cuBlastLt Epilogue.
    2. Support fusion Act(X*Y + bias), X'dims >=2 and Y'dims shoule be 2.
    2. Act currently only be supported ReLU. (Will add GeLU in the future).
    
    * Added UT to fused_gemm_epilogue op.
    
    * Added LinearAct Pattern
    
    1. Added LinearAct into graph_pattern_detector.* to define (2.)'s
    pattern.
    2. LinearAct is used to detect act(element_add(matmul_v2(x, w), bias)).
    3. act currently only support ReLU (Will support GeLU in the future).
    
    * Added FuseGemmEpiloguePass
    
    1, Added FuseGemmEpiloguePass to handle nn.Linear + Act{ReLU}
    fusion (GeLU will be supported in the future).
    2. Only support matmul_v2 from nn.Linear.
    
    * Added pybind to BuildStrageter.fuse_gemm_epilogue_.
    
    * Added UT for fuse_gemm_epilogue_pass.
    
    * GeLU support and EpilogueSingleton
    
    1. Added GeLU support to fused_gemm_epilogue op.
    2. Added EpilogueSingleton to cache auxiliary pointer.
    3. Added related UTs.
    
    * Rename cublaslt_epilogue_opto gemm_epilogue_op.*.
    
    * Added both train and infer pattern to LinearAct.
    
    1. Added support of fwd graph with grap_ops linking to LinearAct.
    2. Added related changes to fuse_gemm_epilogue_pass for above
    modification.
    
    * Changed CUDA requirement from 11.4 to 11.6 for fuse_gemm_epilogue_pass.
    
    * Added identity activation support to gemm_epilogue_op.
    
    * Added Linear Fusion (matmul_v2 + ele_add)
    
    1. Added matmul_v2 + ele_add pattern to LinearActPattern.
    2. Added matmul_v2 + ele_add support to fuse_gemm_epilogue_pass.
    
    * Rename gemm_epilogue_op.* to fused_gemm_epilogue_op.*
    
    * Add fused_gemm_epilogue_grad op.
    
    1. Added fused_gemm_epilogue_grad to support backward epilogue fusion.
    
    * Add UTs to fused_gemm_epilogue_grad_op.
    
    * Change attribute name in fused_gemm_epilogue_grad_op for clearing.
    
    * Allow DX and DBias be dispensable to fused_gemm_epilogue_grad op.
    
    * Added ElementwiseAdd+Matmul+Act graph pattern detection.
    
    * Fuse backward of Linear( Act(x))
    
    1. Added backward fusion pass to Linear( Act(x)).
    2. Added backward fusion pass to Linear(x).
    
    * Added UTs to backward fusion of Linear(Act(x)).
    
    * Complete document of arguments to fused_gemm_epilogue_op.
    
    * Made arguments of some functions pass by reference.
    
    * Modify code with review comments.
    
    1. Made arguments of some function pass by reference.
    2. Removed redundant code.
    3. Followed Google code style to change code.
    
    * Made 'const' code style be consistent
    
    * Fixed random seed of python UTs.
    
    * Set Compiling constrains to cuBlasLt
    
    1. Require CUDA 11.6+
    2. Remove fuse_gemm_epilogue related tests when CUDA < 11.6.
    
    * Code Reivew from Paddle
    
    1. Changed arguments name is_first_gemm to without_x_gradient for
    clearing.
    2. Applied PADDLE_THROW in fused_gemm_epilogue_op.
    
    * Remove EpilogueSingleton
    
    1. Applied ReserveSpace to replace Epilogue for passing auxiliary
    pointers between FWD and BWD.
    
    * Fix a logical error and enhance UTs.
    
    1. Added act op count checking in UTs.
    2. Fix issue to fuse backward or ReLU(Linear(X)).
    3. TODO: solve GELU fusion issues.
    
    * Fix Linear and GeLU fusion issues.
    
    1. Modified graph_detech_pattern to fit with both linear wiht gelu or
    relu.
    2. Modified data range in Uts to allow negative values.
    
    * Removed fused_gemm_epilogue_op.h.
    
    * Rename namespace pten to phi.
    
    * Rename name of arguments in fused_gemm_epilogue_op
    
    1. bias -> Bias.
    2. out -> Out.
    3. reserve_space -> ReserveSpace.
    
    * Change EpiloguePassActivationCache as local variable.
    
    1. Removed singleton in EpiloguePassActivationCache.
    2. Made EpiloguePassActivationCache as an argument to each pass
    functions.
    2a3d9eca
device_context.h 28.3 KB