Some notes on experiment with MFCC
Created by: kuke
MFCCs (Mel Frequency Cepstral Coefficents) are a widely used representation of audio data in ASR (automic speech recognition), which are thought as a better approximation of the human auditory system's response than the linearly-spaced frequency spectrum. And many ASR systems achieved state-of-the-art performance by taking advantage of MFCCs. Considering Deep Speech 2 only taking the power spectrum as its input feature, it is worth evaluating the performance of MFCCs on the same network.
The experimental results will be continuously updated in this issue.
The MFCC feature used here is a 39-dimension vector, consisting of the 13 basic cepstral coefficents and the first and second order derivatives, with the first component replaced by the energy of the frame. At the first attempt, the training process follows the default setting totally in train.py
except adjusting the kernel and padding size in conv layers to adapt the new feature dimension. But the convergence gets a little bit slow. Then inspired by Wav2letter, retrain the model with no striding in the feature dimension. And a relative better convergence appears, as shown in the figure below.
The validation cost doesn't decay significantly at the end, and the training is in progress with smaller learning rate after pass 25. The rest part of learning curves will be appended later.