Deep Speech 2 on PaddlePaddle: Plan & Task Breakdown
Created by: xinghai-sun
We are planning to build Deep Speech 2 (DS2) [1], a powerful Automatic Speech Recognition (ASR) engine, on PaddlePaddle. For the first-stage plan, we have the following short-term goals:
- Release a basic distributed implementation of DS2 on PaddlePaddle.
- Contribute a chapter of Deep Speech to PaddlePaddle Book.
Intensive system optimization and low-latency inference library (details in [1]) are not yet covered in this first-stage plan.
Tasks
We roughly break down the project into 14 tasks:
- Develop an audio data provider:
- Json filelist generator
- Audio file format transformer.
- Spectrogram feature extraction, power normalization etc.
- Batch data reader with SortaGrad.
- Data augmentation (optional).
- Prepare (one or more) public English data sets & baseline.
- https://github.com/PaddlePaddle/Paddle/issues/2226
- https://github.com/PaddlePaddle/Paddle/issues/2227
- Create a simplified DS2 model configuration:
- With only fixed-length (by padding) audio sequences (otherwise need Task 3).
- With only bidirectional-GRU (otherwise need Task 4).
- With only greedy decoder (otherwise need Task 5, 6).
- https://github.com/PaddlePaddle/Paddle/issues/2231
- Develop to support variable-shaped dense-vector (image) batches of input data.
- Update
DenseScanner
indataprovider_converter.py
, etc. - https://github.com/PaddlePaddle/Paddle/issues/2198
- Update
- Develop a new lookahead-row-convolution layer (See [1] for details):
- Lookahead convolution windows.
- Within-row convolution, without kernels shared across rows.
- https://github.com/PaddlePaddle/Paddle/issues/2228
- Build KenLM n-gram language model for beam search decoding:
- Use KenLM toolkit, Kneser-Ney smoothed, 5-gram, with pruning etc.
- Prepare the corpus & train the model.
- Create infererence interfaces plugable to CTC beam search (for Task 6).
- https://github.com/PaddlePaddle/Paddle/issues/2229
- Develop a beam search decoder with CTC + LM + WORDCOUNT:
- Beam search with CTC.
- Beam search with external custom scorer (e.g. LM).
- Try to design a more general beam search interface.
- https://github.com/PaddlePaddle/Paddle/issues/2230
- Develop a Word Error Rate evaluator:
- update
ctc_error_evaluator
(CER) to support WER.
- update
- Prepare internal dataset for Mandarin (optional):
- Dataset, baseline, evaluation details.
- Particular data preprocessing for Mandarin.
- Might need cooperating with the Department of Speech.
- https://github.com/PaddlePaddle/Paddle/issues/2232
- Create standard DS2 model configuration:
- With variable-length audio sequences (need Task 3).
- With unidirectional-GRU + row-convolution (need Task 4).
- With CTC-LM beam search decoder (need Task 5, 6).
- Make it run perfectly on clusters.
- Experiments and benchmarking (for accuracy, not efficiency):
- With public English dataset.
- With internal (Baidu) Mandarin dataset (optional).
- Time profiling and optimization.
- Prepare docs.
- Prepare PaddlePaddle Book chapter with a simplified version.
Task Dependency
Tasks parallelizable within phases:
Roadmap | Description | Parallelizable Tasks |
---|---|---|
Phase I | Basic model & components | Task 1 ~ Task 8 |
Phase II | Standard model & benchmarking & profiling | Task 9 ~ Task 12 |
Phase III | Documentations | Task13 ~ Task14 |
Issue for each task will be created later. Contributions, discussions and comments are all highly appreciated and welcomed!
Possible Future Work
- Efficiency Improvement
- Accuracy Improvement
- Low-latency Inference Library
- Large-scale benchmarking
References
- Dario Amodei, etc., Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin. ICML 2016.