From 21b098f2e486bcb762c2925c4f890cb74bea11b6 Mon Sep 17 00:00:00 2001 From: Jackwaterveg <87408988+Jackwaterveg@users.noreply.github.com> Date: Thu, 9 Sep 2021 17:35:47 +0800 Subject: [PATCH] Update deepspeech_architecture.md --- doc/src/deepspeech_architecture.md | 42 ++++++++++++++++++++++-------- 1 file changed, 31 insertions(+), 11 deletions(-) diff --git a/doc/src/deepspeech_architecture.md b/doc/src/deepspeech_architecture.md index d68ad469..8b281ff8 100644 --- a/doc/src/deepspeech_architecture.md +++ b/doc/src/deepspeech_architecture.md @@ -66,10 +66,19 @@ python3 ../../../utils/compute_mean_std.py \ ### Encoder The Backbone is composed of two 2D convolution subsampling layers and a number of stacked single direction rnn layers. The 2D convolution subsampling layers extract feature represention from the raw audio feature and reduce the length of audio feature at the same time. After passing through the convolution subsampling layers, then the feature represention are input into the stacked rnn layers. For rnn layers, LSTM cell and GRU cell are provided. Adding one fully connected (fc) layer after rnn layer is optional, if the number of rnn layers is less than 5, adding one fc layer after rnn layers is recommand. - +The code of Encoder is in: +``` +vi deepspeech/models/ds2_online/deepspeech2.py +``` + ### Decoder To got the character possibilities of each frame, the feature represention of each frame output from the backbone are input into a projection layer which is implemented as a dense layer to do projection. The output dim of the projection layer is same with the vocabulary size. After projection layer, the softmax function is used to make frame-level feature representation be the possibilities of characters. While making model inference, the character possibilities of each frame are input into the CTC decoder to get the final speech recognition results. - +The code of Encoder is in: +``` +vi deepspeech/models/ds2_online/deepspeech2.py +vi deepspeech/modules/ctc.py +``` + ## Training Process Using the command below, you can train the deepspeech2 online model. ``` @@ -143,25 +152,36 @@ After the training process, we use stage 3,4,5 for testing process. The stage 3 ## No Streaming -The deepspeech2 offline model is similarity to the deepspeech2 online model. The main difference between them is the offline model use the bi-directional rnn layers while the online model use the single direction rnn layers. The arcitecture of the model is shown in Fig.2. +The deepspeech2 offline model is similarity to the deepspeech2 online model. The main difference between them is the offline model use the bi-directional rnn layers while the online model use the single direction rnn layers and the fc layer is not used. + +The arcitecture of the model is shown in Fig.2.
Fig.2 The Arcitecture of deepspeech2 offline model