The last layer of deep neural networks for multi-class classfication tasks is often a combined layer of fully-connected function and softmax function, after which the corss-entropy function is used to compute the final loss. As the size of parameters for the fully-connected layer at the end of a nueral network increasing lineary with the increase in number of classes, the training of a neural network faces the following challenges when the task has a large number of classes to classify:
* GPU memory exhausive:
* Heavy communication overhead: For data parallel training, gradients must be exchanged among GPUs,