Skip to content

  • 体验新版
    • 正在加载...
  • 登录
  • PaddlePaddle
  • models
  • Issue
  • #44

M
models
  • 项目概览

PaddlePaddle / models
大约 2 年 前同步成功

通知 232
Star 6828
Fork 2962
  • 代码
    • 文件
    • 提交
    • 分支
    • Tags
    • 贡献者
    • 分支图
    • Diff
  • Issue 602
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 255
  • Wiki 0
    • Wiki
  • 分析
    • 仓库
    • DevOps
  • 项目成员
  • Pages
M
models
  • 项目概览
    • 项目概览
    • 详情
    • 发布
  • 仓库
    • 仓库
    • 文件
    • 提交
    • 分支
    • 标签
    • 贡献者
    • 分支图
    • 比较
  • Issue 602
    • Issue 602
    • 列表
    • 看板
    • 标记
    • 里程碑
  • 合并请求 255
    • 合并请求 255
  • Pages
  • 分析
    • 分析
    • 仓库分析
    • DevOps
  • Wiki 0
    • Wiki
  • 成员
    • 成员
  • 收起侧边栏
  • 动态
  • 分支图
  • 创建新Issue
  • 提交
  • Issue看板
已关闭
开放中
Opened 5月 18, 2017 by saxon_zh@saxon_zhGuest

Deep Speech 2 on PaddlePaddle: Plan & Task Breakdown

Created by: xinghai-sun

We are planning to build Deep Speech 2 (DS2) [1], a powerful Automatic Speech Recognition (ASR) engine, on PaddlePaddle. For the first-stage plan, we have the following short-term goals:

  • Release a basic distributed implementation of DS2 on PaddlePaddle.
  • Contribute a chapter of Deep Speech to PaddlePaddle Book.

Intensive system optimization and low-latency inference library (details in [1]) are not yet covered in this first-stage plan.

Tasks

We roughly break down the project into 14 tasks:

  1. Develop an audio data provider:
    • Json filelist generator
    • Audio file format transformer.
    • Spectrogram feature extraction, power normalization etc.
    • Batch data reader with SortaGrad.
    • Data augmentation (optional).
    • Prepare (one or more) public English data sets & baseline.
    • https://github.com/PaddlePaddle/Paddle/issues/2226
    • https://github.com/PaddlePaddle/Paddle/issues/2227
  2. Create a simplified DS2 model configuration:
    • With only fixed-length (by padding) audio sequences (otherwise need Task 3).
    • With only bidirectional-GRU (otherwise need Task 4).
    • With only greedy decoder (otherwise need Task 5, 6).
    • https://github.com/PaddlePaddle/Paddle/issues/2231
  3. Develop to support variable-shaped dense-vector (image) batches of input data.
    • Update DenseScanner in dataprovider_converter.py, etc.
    • https://github.com/PaddlePaddle/Paddle/issues/2198
  4. Develop a new lookahead-row-convolution layer (See [1] for details):
    • Lookahead convolution windows.
    • Within-row convolution, without kernels shared across rows.
    • https://github.com/PaddlePaddle/Paddle/issues/2228
  5. Build KenLM n-gram language model for beam search decoding:
    • Use KenLM toolkit, Kneser-Ney smoothed, 5-gram, with pruning etc.
    • Prepare the corpus & train the model.
    • Create infererence interfaces plugable to CTC beam search (for Task 6).
    • https://github.com/PaddlePaddle/Paddle/issues/2229
  6. Develop a beam search decoder with CTC + LM + WORDCOUNT:
    • Beam search with CTC.
    • Beam search with external custom scorer (e.g. LM).
    • Try to design a more general beam search interface.
    • https://github.com/PaddlePaddle/Paddle/issues/2230
  7. Develop a Word Error Rate evaluator:
    • update ctc_error_evaluator(CER) to support WER.
  8. Prepare internal dataset for Mandarin (optional):
    • Dataset, baseline, evaluation details.
    • Particular data preprocessing for Mandarin.
    • Might need cooperating with the Department of Speech.
    • https://github.com/PaddlePaddle/Paddle/issues/2232
  9. Create standard DS2 model configuration:
    • With variable-length audio sequences (need Task 3).
    • With unidirectional-GRU + row-convolution (need Task 4).
    • With CTC-LM beam search decoder (need Task 5, 6).
  10. Make it run perfectly on clusters.
  11. Experiments and benchmarking (for accuracy, not efficiency):
    • With public English dataset.
    • With internal (Baidu) Mandarin dataset (optional).
  12. Time profiling and optimization.
  13. Prepare docs.
  14. Prepare PaddlePaddle Book chapter with a simplified version.

Task Dependency

Tasks parallelizable within phases:

Roadmap Description Parallelizable Tasks
Phase I Basic model & components Task 1 ~ Task 8
Phase II Standard model & benchmarking & profiling Task 9 ~ Task 12
Phase III Documentations Task13 ~ Task14

Issue for each task will be created later. Contributions, discussions and comments are all highly appreciated and welcomed!

Possible Future Work

  • Efficiency Improvement
  • Accuracy Improvement
  • Low-latency Inference Library
  • Large-scale benchmarking

References

  1. Dario Amodei, etc., Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin. ICML 2016.
指派人
分配到
无
里程碑
无
分配里程碑
工时统计
无
截止日期
无
标识: paddlepaddle/models#44
渝ICP备2023009037号

京公网安备11010502055752号

网络110报警服务 Powered by GitLab CE v13.7
开源知识
Git 入门 Pro Git 电子书 在线学 Git
Markdown 基础入门 IT 技术知识开源图谱
帮助
使用手册 反馈建议 博客
《GitCode 隐私声明》 《GitCode 服务条款》 关于GitCode
Powered by GitLab CE v13.7