Fix wrong reduce_dims in fused_gate_attention and optimize the memory usage. (#43216)
* Polish codes and memory usage for fused_gate_attention. * Fix wrong reduce_dims in fused_gate_attention when computing gradient of nonbatched_bias.
Showing
想要评论请 注册 或 登录