Created by: panyx0718
layer_norm forward and backward overall speed up 3x ~ 4x transfomer on a single device step time reduces from 0.157 to 0.125
the precommit also automatically formatted some codes.