The simplest softmax regression model is to feed input to fully connected layers, and directly use softmax for multi-class classification \[[9](#References)\].
Input $X$ is multiplied with weights $W$, added by bias $b$, and activated.
$$ y_i = softmax(\sum_j W_{i,j}x_j + b_i) $$
where $ softmax(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} $
For a $N$ class classification problem with $N$ output nodes, a $N$ dimensional input features is normalized to $N$ real values in [0, 1], each representing the probability of the sample to belong to the class. Here $y_i$ is the prediction probability that an image is digit $i$.
In classification problem, we usually use cross entropy loss function: