* blockReduce opt * launch threads align to warpSize * reduce unnecessary shared memory for broadcast reduced value * vectorize SoftmaxKernelWithEltadd * add fp16 constrain * test=develop
拖放文件到此处或点击上传