Created by: mapingshuo
实验中发现layer_norm_norm gpu kernel在大batch时吞吐下降为原来的28% ,本PR修复这一问题,详细描述详见:#23819 (closed)