Improve numerical stability of softmax + cross entropy loss
Created by: demon386
The combination of cross entropy loss and softmax as last-layer's activation is the standard configuration in classification. To improve the numerical stability, many ML libraries combine the two as one layer. The mathematical motivation is explained in https://www.zhihu.com/question/40403377/answer/86783636 and http://freemind.pluskid.org/machine-learning/softmax-vs-softmax-loss-numerical-stability/
For references, see some implementations in other popular ML libraries: