We are planning to build Deep Speech 2 (DS2) \[[1](#references)\], a powerful Automatic Speech Recognition (ASR) engine, on PaddlePaddle. For the first-stage plan, we have the following short-term goals:
- Release a basic distributed implementation of DS2 on PaddlePaddle.
- Contribute a chapter of Deep Speech to PaddlePaddle Book.
Intensive system optimization and low-latency inference library (details in \[[1](#references)\]) are not yet covered in this first-stage plan.
## Table of Contents
- [Tasks](#tasks)
- [Task Dependency](#task-dependency)
- [Design Details](#design-details)
- [Overview](#overview)
- [Row Convolution](#row-convolution)
- [Beam Search With CTC and LM](#beam-search-with-ctc-and-lm)
- [Future Work](#future-work)
- [References](#references)
## Tasks
We roughly break down the project into 14 tasks:
1. Develop an **audio data provider**:
- Json filelist generator.
- Audio file format transformer.
- Spectrogram feature extraction, power normalization etc.
- Batch data reader with SortaGrad.
- Data augmentation (optional).
- Prepare (one or more) public English data sets & baseline.
2. Create a **simplified DS2 model configuration**:
- With only fixed-length (by padding) audio sequences (otherwise need *Task 3*).
- With only bidirectional-GRU (otherwise need *Task 4*).
- With only greedy decoder (otherwise need *Task 5, 6*).
3. Develop to support **variable-shaped** dense-vector (image) batches of input data.
- Update `DenseScanner` in `dataprovider_converter.py`, etc.
4. Develop a new **lookahead-row-convolution layer** (See \[[1](#references)\] for details):
- Lookahead convolution windows.
- Within-row convolution, without kernels shared across rows.
5. Build KenLM **language model** (5-gram) for beam search decoder:
- Use KenLM toolkit.
- Prepare the corpus & train the model.
- Create infererence interfaces (for Task 6).
6. Develop a **beam search decoder** with CTC + LM + WORDCOUNT:
- Beam search with CTC.
- Beam search with external custom scorer (e.g. LM).
- Try to design a more general beam search interface.
7. Develop a **Word Error Rate evaluator**:
- update `ctc_error_evaluator`(CER) to support WER.
8. Prepare internal dataset for Mandarin (optional):
- Dataset, baseline, evaluation details.
- Particular data preprocessing for Mandarin.
- Might need cooperating with the Speech Department.
9. Create **standard DS2 model configuration**:
- With variable-length audio sequences (need *Task 3*).
- With unidirectional-GRU + row-convolution (need *Task 4*).
- With CTC-LM beam search decoder (need *Task 5, 6*).
10. Make it run perfectly on **clusters**.
11. Experiments and **benchmarking** (for accuracy, not efficiency):
- With public English dataset.
- With internal (Baidu) Mandarin dataset (optional).
12. Time **profiling** and optimization.
13. Prepare **docs**.
14. Prepare PaddlePaddle **Book** chapter with a simplified version.
Phase I | Simplified model & components | *Task 1* ~ *Task 8*
Phase II | Standard model & benchmarking & profiling | *Task 9* ~ *Task 12*
Phase III | Documentations | *Task13* ~ *Task14*
Issue for each task will be created later. Contributions, discussions and comments are all highly appreciated and welcomed!
## Design Details
### Overview
Traditional **ASR** (Automatic Speech Recognition) pipelines require great human efforts devoted to elaborately tuning multiple hand-engineered components (e.g. audio feature design, accoustic model, pronuncation model and language model etc.). **Deep Speech 2** (**DS2**) \[[1](#references)\], however, trains such ASR models in an end-to-end manner, replacing most intermediate modules with only a single deep network architecture. With scaling up both the data and model sizes, DS2 achieves a very significant performance boost.
Please read Deep Speech 2 \[[1](#references),[2](#references)\] paper for more background knowledge.
The classical DS2 network contains 15 layers (from bottom to top):
- **Two** data layers (audio spectrogram, transcription text)
- **Three** 2D convolution layers
- **Seven** uni-directional simple-RNN layers
- **One** lookahead row convolution layers
- **One** fully-connected layers
- **One** CTC-loss layer
<div align="center">
<img src="image/ds2_network.png" width=350><br/>
Figure 1. Archetecture of Deep Speech 2 Network.
</div>
We don't have to persist on this 2-3-7-1-1-1 depth \[[2](#references)\]. Similar networks with different depths might also work well. As in \[[1](#references)\], authors use a different depth (e.g. 2-2-3-1-1-1) for final experiments.
Key ingredients about the layers:
- **Data Layers**:
- Frame sequences data of audio **spectrogram** (with FFT).
- Token sequences data of **transcription** text (labels).
- These two type of sequences do not have the same lengthes, thus a CTC-loss layer is required.
- **2D Convolution Layers**:
- Not only temporal convolution, but also **frequency convolution**. Like a 2D image convolution, but with a variable dimension (i.e. temporal dimension).
- With striding for only the first convlution layer.
- No pooling for all convolution layers.
- **Uni-directional RNNs**
- Uni-directional + row convolution: for low-latency inference.
- Bi-direcitional + without row convolution: if we don't care about the inference latency.
- **Row convolution**:
- For looking only a few steps ahead into the feature, instead of looking into a whole sequence in bi-directional RNNs.
- Not nessesary if with bi-direcitional RNNs.
- "**Row**" means convolutions are done within each frequency dimension (row), and no convolution kernels shared across.
- **Batch Normalization Layers**:
- Added to all above layers (except for data and loss layer).
- Sequence-wise normalization for RNNs: BatchNorm only performed on input-state projection and not state-state projection, for efficiency consideration.
Required Components | PaddlePaddle Support | Need to Develop
CTC-Beam search | Not supported yet. | TBD (Task 6)
### Row Convolution
TODO by Assignees
### Beam Search with CTC and LM
<div align="center">
<img src="image/beam_search.png" width=600><br/>
Figure 2. Algorithm for CTC Beam Search Decoder.
</div>
- The **Beam Search Decoder** for DS2 CTC-trained network follows the similar approach in \[[3](#references)\] as shown in Figure 2, with two important modifications for the ambiguous parts:
- 1) in the iterative computation of probabilities, the assignment operation is changed to accumulation for one prefix may comes from different paths;
- 2) the if condition ```if l^+ not in A_prev then``` after probabilities' computation is deprecated for it is hard to understand and seems unnecessary.
- An **external scorer** would be passed into the decoder to evaluate a candidate prefix during decoding whenever a white space appended in English decoding and any character appended in Mandarin decoding.
- Such external scorer consists of language model, word count or any other custom scorers.
- The **language model** is built from Task 5, with parameters should be carefully tuned to achieve minimum WER/CER (c.f. Task 7)
- This decoder needs to perform with **high efficiency** for the convenience of parameters tuning and speech recognition in reality.
## Future Work
- Efficiency Improvement
- Accuracy Improvement
- Low-latency Inference Library
- Large-scale benchmarking
## References
1. Dario Amodei, etc., [Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin](http://proceedings.mlr.press/v48/amodei16.pdf). ICML 2016.
2. Dario Amodei, etc., [Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin](https://arxiv.org/abs/1512.02595). arXiv:1512.02595.
3. Awni Y. Hannun, etc. [First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs](https://arxiv.org/abs/1408.2873). arXiv:1408.2873
<spanid="deepspeech2-on-paddlepaddle-design-doc"></span><h1>DeepSpeech2 on PaddlePaddle: Design Doc<aclass="headerlink"href="#deepspeech2-on-paddlepaddle-design-doc"title="Permalink to this headline">¶</a></h1>
<p>We are planning to build Deep Speech 2 (DS2) [<aclass="reference external"href="#references">1</a>], a powerful Automatic Speech Recognition (ASR) engine, on PaddlePaddle. For the first-stage plan, we have the following short-term goals:</p>
<ulclass="simple">
<li>Release a basic distributed implementation of DS2 on PaddlePaddle.</li>
<li>Contribute a chapter of Deep Speech to PaddlePaddle Book.</li>
</ul>
<p>Intensive system optimization and low-latency inference library (details in [<aclass="reference external"href="#references">1</a>]) are not yet covered in this first-stage plan.</p>
<divclass="section"id="table-of-contents">
<spanid="table-of-contents"></span><h2>Table of Contents<aclass="headerlink"href="#table-of-contents"title="Permalink to this headline">¶</a></h2>
Phase I | Simplified model & components | <em>Task 1</em> ~ <em>Task 8</em>
Phase II | Standard model & benchmarking & profiling | <em>Task 9</em> ~ <em>Task 12</em>
Phase III | Documentations | <em>Task13</em> ~ <em>Task14</em></p>
<p>Issue for each task will be created later. Contributions, discussions and comments are all highly appreciated and welcomed!</p>
</div>
<divclass="section"id="design-details">
<spanid="design-details"></span><h2>Design Details<aclass="headerlink"href="#design-details"title="Permalink to this headline">¶</a></h2>
<divclass="section"id="overview">
<spanid="overview"></span><h3>Overview<aclass="headerlink"href="#overview"title="Permalink to this headline">¶</a></h3>
<p>Traditional <strong>ASR</strong> (Automatic Speech Recognition) pipelines require great human efforts devoted to elaborately tuning multiple hand-engineered components (e.g. audio feature design, accoustic model, pronuncation model and language model etc.). <strong>Deep Speech 2</strong> (<strong>DS2</strong>) [<aclass="reference external"href="#references">1</a>], however, trains such ASR models in an end-to-end manner, replacing most intermediate modules with only a single deep network architecture. With scaling up both the data and model sizes, DS2 achieves a very significant performance boost.</p>
<p>Please read Deep Speech 2 [<aclass="reference external"href="#references">1</a>,<aclass="reference external"href="#references">2</a>] paper for more background knowledge.</p>
</div><p>We don’t have to persist on this 2-3-7-1-1-1 depth [<aclass="reference external"href="#references">2</a>]. Similar networks with different depths might also work well. As in [<aclass="reference external"href="#references">1</a>], authors use a different depth (e.g. 2-2-3-1-1-1) for final experiments.</p>
<p>Key ingredients about the layers:</p>
<ulclass="simple">
<li><strong>Data Layers</strong>:<ul>
<li>Frame sequences data of audio <strong>spectrogram</strong> (with FFT).</li>
<li>Token sequences data of <strong>transcription</strong> text (labels).</li>
<li>These two type of sequences do not have the same lengthes, thus a CTC-loss layer is required.</li>
</ul>
</li>
<li><strong>2D Convolution Layers</strong>:<ul>
<li>Not only temporal convolution, but also <strong>frequency convolution</strong>. Like a 2D image convolution, but with a variable dimension (i.e. temporal dimension).</li>
<li>With striding for only the first convlution layer.</li>
<li>No pooling for all convolution layers.</li>
</ul>
</li>
<li><strong>Uni-directional RNNs</strong><ul>
<li>Uni-directional + row convolution: for low-latency inference.</li>
<li>Bi-direcitional + without row convolution: if we don’t care about the inference latency.</li>
</ul>
</li>
<li><strong>Row convolution</strong>:<ul>
<li>For looking only a few steps ahead into the feature, instead of looking into a whole sequence in bi-directional RNNs.</li>
<li>Not nessesary if with bi-direcitional RNNs.</li>
<li>“<strong>Row</strong>” means convolutions are done within each frequency dimension (row), and no convolution kernels shared across.</li>
<li>Added to all above layers (except for data and loss layer).</li>
<li>Sequence-wise normalization for RNNs: BatchNorm only performed on input-state projection and not state-state projection, for efficiency consideration.</li>
</ul>
</li>
</ul>
<p>Required Components | PaddlePaddle Support | Need to Develop
<spanid="beam-search-with-ctc-and-lm"></span><h3>Beam Search with CTC and LM<aclass="headerlink"href="#beam-search-with-ctc-and-lm"title="Permalink to this headline">¶</a></h3>
<divalign="center">
<imgsrc="image/beam_search.png"width=600><br/>
Figure 2. Algorithm for CTC Beam Search Decoder.
</div><ulclass="simple">
<li>The <strong>Beam Search Decoder</strong> for DS2 CTC-trained network follows the similar approach in [<aclass="reference external"href="#references">3</a>] as shown in Figure 2, with two important modifications for the ambiguous parts:<ul>
<li><olclass="first">
<li>in the iterative computation of probabilities, the assignment operation is changed to accumulation for one prefix may comes from different paths;</li>
</ol>
</li>
<li><olclass="first">
<li>the if condition <codeclass="docutils literal"><spanclass="pre">if</span><spanclass="pre">l^+</span><spanclass="pre">not</span><spanclass="pre">in</span><spanclass="pre">A_prev</span><spanclass="pre">then</span></code> after probabilities’ computation is deprecated for it is hard to understand and seems unnecessary.</li>
</ol>
</li>
</ul>
</li>
<li>An <strong>external scorer</strong> would be passed into the decoder to evaluate a candidate prefix during decoding whenever a white space appended in English decoding and any character appended in Mandarin decoding.</li>
<li>Such external scorer consists of language model, word count or any other custom scorers.</li>
<li>The <strong>language model</strong> is built from Task 5, with parameters should be carefully tuned to achieve minimum WER/CER (c.f. Task 7)</li>
<li>This decoder needs to perform with <strong>high efficiency</strong> for the convenience of parameters tuning and speech recognition in reality.</li>
</ul>
</div>
</div>
<divclass="section"id="future-work">
<spanid="future-work"></span><h2>Future Work<aclass="headerlink"href="#future-work"title="Permalink to this headline">¶</a></h2>
<ulclass="simple">
<li>Efficiency Improvement</li>
<li>Accuracy Improvement</li>
<li>Low-latency Inference Library</li>
<li>Large-scale benchmarking</li>
</ul>
</div>
<divclass="section"id="references">
<spanid="references"></span><h2>References<aclass="headerlink"href="#references"title="Permalink to this headline">¶</a></h2>
<olclass="simple">
<li>Dario Amodei, etc., <aclass="reference external"href="http://proceedings.mlr.press/v48/amodei16.pdf">Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin</a>. ICML 2016.</li>
<li>Dario Amodei, etc., <aclass="reference external"href="https://arxiv.org/abs/1512.02595">Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin</a>. arXiv:1512.02595.</li>
<li>Awni Y. Hannun, etc. <aclass="reference external"href="https://arxiv.org/abs/1408.2873">First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs</a>. arXiv:1408.2873</li>
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.
We are planning to build Deep Speech 2 (DS2) \[[1](#references)\], a powerful Automatic Speech Recognition (ASR) engine, on PaddlePaddle. For the first-stage plan, we have the following short-term goals:
- Release a basic distributed implementation of DS2 on PaddlePaddle.
- Contribute a chapter of Deep Speech to PaddlePaddle Book.
Intensive system optimization and low-latency inference library (details in \[[1](#references)\]) are not yet covered in this first-stage plan.
## Table of Contents
- [Tasks](#tasks)
- [Task Dependency](#task-dependency)
- [Design Details](#design-details)
- [Overview](#overview)
- [Row Convolution](#row-convolution)
- [Beam Search With CTC and LM](#beam-search-with-ctc-and-lm)
- [Future Work](#future-work)
- [References](#references)
## Tasks
We roughly break down the project into 14 tasks:
1. Develop an **audio data provider**:
- Json filelist generator.
- Audio file format transformer.
- Spectrogram feature extraction, power normalization etc.
- Batch data reader with SortaGrad.
- Data augmentation (optional).
- Prepare (one or more) public English data sets & baseline.
2. Create a **simplified DS2 model configuration**:
- With only fixed-length (by padding) audio sequences (otherwise need *Task 3*).
- With only bidirectional-GRU (otherwise need *Task 4*).
- With only greedy decoder (otherwise need *Task 5, 6*).
3. Develop to support **variable-shaped** dense-vector (image) batches of input data.
- Update `DenseScanner` in `dataprovider_converter.py`, etc.
4. Develop a new **lookahead-row-convolution layer** (See \[[1](#references)\] for details):
- Lookahead convolution windows.
- Within-row convolution, without kernels shared across rows.
5. Build KenLM **language model** (5-gram) for beam search decoder:
- Use KenLM toolkit.
- Prepare the corpus & train the model.
- Create infererence interfaces (for Task 6).
6. Develop a **beam search decoder** with CTC + LM + WORDCOUNT:
- Beam search with CTC.
- Beam search with external custom scorer (e.g. LM).
- Try to design a more general beam search interface.
7. Develop a **Word Error Rate evaluator**:
- update `ctc_error_evaluator`(CER) to support WER.
8. Prepare internal dataset for Mandarin (optional):
- Dataset, baseline, evaluation details.
- Particular data preprocessing for Mandarin.
- Might need cooperating with the Speech Department.
9. Create **standard DS2 model configuration**:
- With variable-length audio sequences (need *Task 3*).
- With unidirectional-GRU + row-convolution (need *Task 4*).
- With CTC-LM beam search decoder (need *Task 5, 6*).
10. Make it run perfectly on **clusters**.
11. Experiments and **benchmarking** (for accuracy, not efficiency):
- With public English dataset.
- With internal (Baidu) Mandarin dataset (optional).
12. Time **profiling** and optimization.
13. Prepare **docs**.
14. Prepare PaddlePaddle **Book** chapter with a simplified version.
Phase I | Simplified model & components | *Task 1* ~ *Task 8*
Phase II | Standard model & benchmarking & profiling | *Task 9* ~ *Task 12*
Phase III | Documentations | *Task13* ~ *Task14*
Issue for each task will be created later. Contributions, discussions and comments are all highly appreciated and welcomed!
## Design Details
### Overview
Traditional **ASR** (Automatic Speech Recognition) pipelines require great human efforts devoted to elaborately tuning multiple hand-engineered components (e.g. audio feature design, accoustic model, pronuncation model and language model etc.). **Deep Speech 2** (**DS2**) \[[1](#references)\], however, trains such ASR models in an end-to-end manner, replacing most intermediate modules with only a single deep network architecture. With scaling up both the data and model sizes, DS2 achieves a very significant performance boost.
Please read Deep Speech 2 \[[1](#references),[2](#references)\] paper for more background knowledge.
The classical DS2 network contains 15 layers (from bottom to top):
- **Two** data layers (audio spectrogram, transcription text)
- **Three** 2D convolution layers
- **Seven** uni-directional simple-RNN layers
- **One** lookahead row convolution layers
- **One** fully-connected layers
- **One** CTC-loss layer
<div align="center">
<img src="image/ds2_network.png" width=350><br/>
Figure 1. Archetecture of Deep Speech 2 Network.
</div>
We don't have to persist on this 2-3-7-1-1-1 depth \[[2](#references)\]. Similar networks with different depths might also work well. As in \[[1](#references)\], authors use a different depth (e.g. 2-2-3-1-1-1) for final experiments.
Key ingredients about the layers:
- **Data Layers**:
- Frame sequences data of audio **spectrogram** (with FFT).
- Token sequences data of **transcription** text (labels).
- These two type of sequences do not have the same lengthes, thus a CTC-loss layer is required.
- **2D Convolution Layers**:
- Not only temporal convolution, but also **frequency convolution**. Like a 2D image convolution, but with a variable dimension (i.e. temporal dimension).
- With striding for only the first convlution layer.
- No pooling for all convolution layers.
- **Uni-directional RNNs**
- Uni-directional + row convolution: for low-latency inference.
- Bi-direcitional + without row convolution: if we don't care about the inference latency.
- **Row convolution**:
- For looking only a few steps ahead into the feature, instead of looking into a whole sequence in bi-directional RNNs.
- Not nessesary if with bi-direcitional RNNs.
- "**Row**" means convolutions are done within each frequency dimension (row), and no convolution kernels shared across.
- **Batch Normalization Layers**:
- Added to all above layers (except for data and loss layer).
- Sequence-wise normalization for RNNs: BatchNorm only performed on input-state projection and not state-state projection, for efficiency consideration.
Required Components | PaddlePaddle Support | Need to Develop
CTC-Beam search | Not supported yet. | TBD (Task 6)
### Row Convolution
TODO by Assignees
### Beam Search with CTC and LM
<div align="center">
<img src="image/beam_search.png" width=600><br/>
Figure 2. Algorithm for CTC Beam Search Decoder.
</div>
- The **Beam Search Decoder** for DS2 CTC-trained network follows the similar approach in \[[3](#references)\] as shown in Figure 2, with two important modifications for the ambiguous parts:
- 1) in the iterative computation of probabilities, the assignment operation is changed to accumulation for one prefix may comes from different paths;
- 2) the if condition ```if l^+ not in A_prev then``` after probabilities' computation is deprecated for it is hard to understand and seems unnecessary.
- An **external scorer** would be passed into the decoder to evaluate a candidate prefix during decoding whenever a white space appended in English decoding and any character appended in Mandarin decoding.
- Such external scorer consists of language model, word count or any other custom scorers.
- The **language model** is built from Task 5, with parameters should be carefully tuned to achieve minimum WER/CER (c.f. Task 7)
- This decoder needs to perform with **high efficiency** for the convenience of parameters tuning and speech recognition in reality.
## Future Work
- Efficiency Improvement
- Accuracy Improvement
- Low-latency Inference Library
- Large-scale benchmarking
## References
1. Dario Amodei, etc., [Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin](http://proceedings.mlr.press/v48/amodei16.pdf). ICML 2016.
2. Dario Amodei, etc., [Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin](https://arxiv.org/abs/1512.02595). arXiv:1512.02595.
3. Awni Y. Hannun, etc. [First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs](https://arxiv.org/abs/1408.2873). arXiv:1408.2873
<spanid="deepspeech2-on-paddlepaddle-design-doc"></span><h1>DeepSpeech2 on PaddlePaddle: Design Doc<aclass="headerlink"href="#deepspeech2-on-paddlepaddle-design-doc"title="永久链接至标题">¶</a></h1>
<p>We are planning to build Deep Speech 2 (DS2) [<aclass="reference external"href="#references">1</a>], a powerful Automatic Speech Recognition (ASR) engine, on PaddlePaddle. For the first-stage plan, we have the following short-term goals:</p>
<ulclass="simple">
<li>Release a basic distributed implementation of DS2 on PaddlePaddle.</li>
<li>Contribute a chapter of Deep Speech to PaddlePaddle Book.</li>
</ul>
<p>Intensive system optimization and low-latency inference library (details in [<aclass="reference external"href="#references">1</a>]) are not yet covered in this first-stage plan.</p>
<divclass="section"id="table-of-contents">
<spanid="table-of-contents"></span><h2>Table of Contents<aclass="headerlink"href="#table-of-contents"title="永久链接至标题">¶</a></h2>
<p>Traditional <strong>ASR</strong> (Automatic Speech Recognition) pipelines require great human efforts devoted to elaborately tuning multiple hand-engineered components (e.g. audio feature design, accoustic model, pronuncation model and language model etc.). <strong>Deep Speech 2</strong> (<strong>DS2</strong>) [<aclass="reference external"href="#references">1</a>], however, trains such ASR models in an end-to-end manner, replacing most intermediate modules with only a single deep network architecture. With scaling up both the data and model sizes, DS2 achieves a very significant performance boost.</p>
<p>Please read Deep Speech 2 [<aclass="reference external"href="#references">1</a>,<aclass="reference external"href="#references">2</a>] paper for more background knowledge.</p>
</div><p>We don’t have to persist on this 2-3-7-1-1-1 depth [<aclass="reference external"href="#references">2</a>]. Similar networks with different depths might also work well. As in [<aclass="reference external"href="#references">1</a>], authors use a different depth (e.g. 2-2-3-1-1-1) for final experiments.</p>
<p>Key ingredients about the layers:</p>
<ulclass="simple">
<li><strong>Data Layers</strong>:<ul>
<li>Frame sequences data of audio <strong>spectrogram</strong> (with FFT).</li>
<li>Token sequences data of <strong>transcription</strong> text (labels).</li>
<li>These two type of sequences do not have the same lengthes, thus a CTC-loss layer is required.</li>
</ul>
</li>
<li><strong>2D Convolution Layers</strong>:<ul>
<li>Not only temporal convolution, but also <strong>frequency convolution</strong>. Like a 2D image convolution, but with a variable dimension (i.e. temporal dimension).</li>
<li>With striding for only the first convlution layer.</li>
<li>No pooling for all convolution layers.</li>
</ul>
</li>
<li><strong>Uni-directional RNNs</strong><ul>
<li>Uni-directional + row convolution: for low-latency inference.</li>
<li>Bi-direcitional + without row convolution: if we don’t care about the inference latency.</li>
</ul>
</li>
<li><strong>Row convolution</strong>:<ul>
<li>For looking only a few steps ahead into the feature, instead of looking into a whole sequence in bi-directional RNNs.</li>
<li>Not nessesary if with bi-direcitional RNNs.</li>
<li>“<strong>Row</strong>” means convolutions are done within each frequency dimension (row), and no convolution kernels shared across.</li>
<li>Added to all above layers (except for data and loss layer).</li>
<li>Sequence-wise normalization for RNNs: BatchNorm only performed on input-state projection and not state-state projection, for efficiency consideration.</li>
</ul>
</li>
</ul>
<p>Required Components | PaddlePaddle Support | Need to Develop
<spanid="beam-search-with-ctc-and-lm"></span><h3>Beam Search with CTC and LM<aclass="headerlink"href="#beam-search-with-ctc-and-lm"title="永久链接至标题">¶</a></h3>
<divalign="center">
<imgsrc="image/beam_search.png"width=600><br/>
Figure 2. Algorithm for CTC Beam Search Decoder.
</div><ulclass="simple">
<li>The <strong>Beam Search Decoder</strong> for DS2 CTC-trained network follows the similar approach in [<aclass="reference external"href="#references">3</a>] as shown in Figure 2, with two important modifications for the ambiguous parts:<ul>
<li><olclass="first">
<li>in the iterative computation of probabilities, the assignment operation is changed to accumulation for one prefix may comes from different paths;</li>
</ol>
</li>
<li><olclass="first">
<li>the if condition <codeclass="docutils literal"><spanclass="pre">if</span><spanclass="pre">l^+</span><spanclass="pre">not</span><spanclass="pre">in</span><spanclass="pre">A_prev</span><spanclass="pre">then</span></code> after probabilities’ computation is deprecated for it is hard to understand and seems unnecessary.</li>
</ol>
</li>
</ul>
</li>
<li>An <strong>external scorer</strong> would be passed into the decoder to evaluate a candidate prefix during decoding whenever a white space appended in English decoding and any character appended in Mandarin decoding.</li>
<li>Such external scorer consists of language model, word count or any other custom scorers.</li>
<li>The <strong>language model</strong> is built from Task 5, with parameters should be carefully tuned to achieve minimum WER/CER (c.f. Task 7)</li>
<li>This decoder needs to perform with <strong>high efficiency</strong> for the convenience of parameters tuning and speech recognition in reality.</li>
<li>Dario Amodei, etc., <aclass="reference external"href="http://proceedings.mlr.press/v48/amodei16.pdf">Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin</a>. ICML 2016.</li>
<li>Dario Amodei, etc., <aclass="reference external"href="https://arxiv.org/abs/1512.02595">Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin</a>. arXiv:1512.02595.</li>
<li>Awni Y. Hannun, etc. <aclass="reference external"href="https://arxiv.org/abs/1408.2873">First-Pass Large Vocabulary Continuous Speech Recognition using Bi-Directional Recurrent DNNs</a>. arXiv:1408.2873</li>
Built with <ahref="http://sphinx-doc.org/">Sphinx</a> using a <ahref="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <ahref="https://readthedocs.org">Read the Docs</a>.