Implement the layer normalization operator.
Created by: lcy-seso
Layer normalization is required by the Transformer. I have ever had this note https://github.com/lcy-seso/learning_notes/blob/master/layer_normalization/layer_normalization.pdf .
Layer normalization is just transpose the input X first, and then the left calculation is the same as in batch normalization. But layer normalization does not need to maintain the the moving average of standard variance and means. The training and testing are the same for layer normalizatoin.
When impelementing the layer normalization operator. My rough thought is it is better to consider how to refactor/reuse the codes in batch normalization, however, it is also acceptable if after more thinking about this operator, we found it is better to make the impelmentation independent of batch normalization.