# 1. Introduction
## 1.1 Overview
PP-ASR provides a tool for ASR functions. It provides a variety models in Chinese and English, supports model training, and supports model inference using the command line. PP-ASR also supports the deployment of streaming models and the deployment of personalized scenarios. PP-ASR supports multiple pre-training models: [released_model](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/released_model.md). The better model is Conformer model which supports streaming ASR.

## 1.2 Features
The basic process of speech recognition is shown in the following figure: 
<div align=center>
<img src="https://user-images.githubusercontent.com/87408988/168259962-cbe2008b-47b6-443d-9566-d77a5ca2eb25.png"/>
<br>
</div>
<br></br>
The main features of PP-ASR are as follows:

1. Provide pre-trained models on Chinese/English open source datasets: aishell1(Chinese), wenetspeech (Chinese), librispeech (English). The model includes the deepspeech2 model and the conformer/transformer model.
2. Support Chinese/English model training function.
3. Support command line model inference, you can use paddlespeech asr --model xxx --input xxx.wav to call each pre-trained model for inference.
4. Support the service deployment of streaming ASR, and also support the output of timestamps.
5. Support the deployment of personalized scenarios.

Welcome to [PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/paddlespeech) for more experience!


# 2. Model Representation and Application Scenarios
## 2.1 Streaming Speech Recognition Task

Automatic Speech Recognition (ASR) is a task to extract language and text content from a piece of speech. The streaming speech recognition is that the user segments a whole speech and inputs it in streaming mode, and finally gets the recognition result.

The real-time speech recognition engine can simultaneously extract and decode the features of the segmented input speech without waiting for all the data to be obtained. Therefore, after the last speech, the final recognition result can be returned only after a short delay (that is, the time to wait for processing the last speech segment and obtaining the final result). This streaming input mode can shorten the overall time to obtain the final results, and greatly improve the user experience.

## 2.2 Application Scenario
1. Human–Computer Interaction/Speech Input    
Streaming speech recognition can generate text in real time when users speak, speeding up the feedback speed of machines to people, and improving the user experience.


<div align=center>
<img src="https://ai-studio-static-online.cdn.bcebos.com/6a68196417234818b3241616a1649741eef4f919c67141d9b9ad371780d110a8" height=50%, width=50%/>
<br>
  (Baidu smart audio: https://dumall.baidu.com/)
</div>

  
2. Real Time Subtitles/Meeting Minutes    
In the meeting scene, speak while transcribing the text.
Convert audio information of meetings, court trials, interviews and other scenes into text, which is realized by real-time speech recognition services, reducing manual recording costs and improving efficiency.

<div align=center>
<img src="https://ai-studio-static-online.cdn.bcebos.com/546271f5bad341acb208d3d497874028da5a664e9e1e460eb61af6a742e89aeb" height=70%, width=70%/>
<br>
(Baidu Intelligent Conference System: One Finger Zen)
</div>

3. Simultaneous Translation
When the machine performs simultaneous translation, the machine needs to be able to recognize the user's speech content in real time, so as to translate the speech content into other languages in real time through the translation module.   

<div align=center>
<img href="https://infoflow.baidu.com/audio-video/#/" src="https://ai-studio-static-online.cdn.bcebos.com/7472f6f976e94e3288dacb0a8bffd9a824f31e392e48496d830f5f11626c0851" height=50%, width=50%/>
<br>
  (Ruliu: intelligent conference https://infoflow.baidu.com/audio-video/#/)
</div>

4. Telephone Quality Inspection    
Turn the call into text, which is realized by real-time speech recognition service or recording file recognition service, to comprehensively cover the quality inspection content and improve the quality inspection efficiency.    

<div align=center>
<img src="https://ai-studio-static-online.cdn.bcebos.com/cbd0af3553ff4b8891bb6239069ad76d95bbc36fb98444378a3b3d716eb1fbcb" height=40%, width=40%/>
</div>

5. Speech Message Transfer    
Turn the user's speech information into text information, which is realized by one sentence recognition service, and improve the user's reading efficiency.    

## 2.3 Datasets
The model uses the 10000 hour multi-domain Chinese speech recognition dataset wenetspeech。

## 2.4 Demonstration
The effect of using asr server on the webpage is shown as follows:[streaming_asr_demo_video](https://paddlespeech.readthedocs.io/en/latest/streaming_asr_demo_video.html)

# 3. Model Usage
## 3.1 Model Inference
### install paddlespeech

In [None]:
!pip install paddlespeech==1.2.0
!pip install paddleaudio==1.0.1

### download test audio

In [None]:
!wget https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav

### inference

In [None]:
from paddlespeech.cli.asr.infer import ASRExecutor
audio = "zh.wav"
asr = ASRExecutor()
result = asr(audio_file=audio, model='conformer_online_wenetspeech')
print(result)

## 3.2 model training
[Streaming conformer training based on wenetspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/wenetspeech/asr1)

# 4. Principle of Streaming Conformer Model

## 4.1 Confomer Model Structure

<div align=center>
<img src="https://ai-studio-static-online.cdn.bcebos.com/0fc40fc45a8f4046beea14eb69cfc1eee52196d9db974442a4c4df8007f8d70d" height=1200, width=800 />
<br>
</div>
  
  
Conformer is mainly composed of Encoder and Decoder. The overall model structure is very similar to Transformer.  
Conformer and Transformer have the same Decoder, with two main differences:  
1. The Encoder of the Conformer contains the conv module. The conv module consists of five parts: pointwise conv, GLU layer, Depthwith conv, RELU layer, and the second pointwise conv layer. 
2. The Encoder of Conform uses two layers of FeedForward, which are located at the head and tail of each layer of encoder respectively. The weight of each layer output is set to 0.5, which is similar to the structure of a hamburger as a whole.





## 4.2 Streaming Conformer
Streaming decoding is mainly divided into two steps:
1. While speaking: use CTC prefix beam search to decode
2. End of speech: use CTC prefix beam search + attention_rescoring to decode. attention_rescoring mainly uses decoder to re-score ctc results, thus changing the candidate ranking of whole sentence ctc results.

<div align=center>
<img src="https://ai-studio-static-online.cdn.bcebos.com/c37339dbaf5c4c20a67b76d88c6730bb1cd93fc7f71b4179982f42365b969f49" height=1200, width=800 />
<br>(image from "Chao Yang http://placebokkk.github.io/wenet/2021/06/04/asr-wenet-nn-1.html" )
</div>


Therefore, the core of streaming decoding lies in supporting streaming CTC prefix beam search, and streaming CTC prefix beam search lies in training an encoder that can support streaming.




### 4.2.1 Point 1: Causal convolution to avoid high delay
For traditional convolution networks, if many layers of convolution are used, each step of the network output will rely heavily on the multiple frames after the current step, thus increasing the delay of the streaming model. However, there are a large number of conv layers in the conformer model. Therefore, if traditional convolution is used, the delay of the streaming conformer model will be large.  
In order to solve this problem, stream conformer uses causal convolution. The output of each step of causal convolution will only depend on the previous time point, not the subsequent time point, similar to the RNN structure of convolution implementation. Thus, the high delay of the conformer model is avoided.
<div align=center>
<img src="https://ai-studio-static-online.cdn.bcebos.com/e77dddf4e0514724b3f24e9f6931aaf1054ebf0b4c1348b59aee6d3a13f833fe" height=800, width=500 />
<br>(image from "Bai S, Kolter J Z, Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling" )
</div>




### 4.2.2 Point 2: Attention with mask 

The main challenge to implement streaming encoder is that the attention structure of the conformer usually uses global information, as shown in the first sub figure in the following figure, so streaming cannot be implemented. In order to solve this problem, streaming conformer will limit the scope of attention during training. 
The main strategies for the scope of attention are shown in the following figure:

<div align=center>
<img src="https://ai-studio-static-online.cdn.bcebos.com/5a8cecc5d0b54898bd9ee4d4573433de992d68a234de418daaa02e6f80289b46" height=1200, width=800 />
<br>(image from "Chao Yang http://placebokkk.github.io/wenet/2021/06/04/asr-wenet-nn-1.html" )
</div>

In order to use speech context information as much as possible, we generally use the third type of attention scope. 
In the training process, in order to enhance the robustness of the model, and also make the model applicable to a variety of chunk sizes in the decoding process, for each batch of data, the random chunk size will be used for training. 
In the decoding process, we use a fixed chunk size for decoding. 

### 4.2.3 Point 3: Cache
In the process of decoding, the conformer will use cache to reduce redundant computation.  
The cache of the conformer encoder is mainly divided into three parts: 
1. subsampling_cache  
2. conformer_cnn_cache 
3. elayers_output_cache 
```
		# Feed forward overlap input step by step
        for cur in range(0, num_frames - context + 1, stride):
            end = min(cur + decoding_window, num_frames)
            chunk_xs = xs[:, cur:end, :]
            (y, subsampling_cache, elayers_output_cache,
             conformer_cnn_cache) = self.forward_chunk(
                 chunk_xs, offset, required_cache_size, subsampling_cache,
                 elayers_output_cache, conformer_cnn_cache)
            outputs.append(y)
            offset += y.shape[1]
        ys = paddle.cat(outputs, 1)
```

<div align=center>
<img src="https://ai-studio-static-online.cdn.bcebos.com/a8e0ff53e2b54fbfbc6f8715dfcba8a50d05b13228eb4ef598a0445336dd3a03" height=1200, width=800 />
<br>(image from "Chao Yang http://placebokkk.github.io/wenet/2021/06/04/asr-wenet-nn-1.html" )
</div>



1. subsampling cache: [paddle.Tensor]     
The output cache of subsampling is the input of the first conformer block. It is used to cache the results of input features after passing through the subsampling module, while the current input chunk and subsampling cache are combined as the input of the conformer encoder. The subsampling module used by the conformer is mainly composed of two layers of cnn and one layer of linear.  
 
2. conformer_cnn_cache: List[paddle.Tensor]   
It mainly stores the input of the conv module in each conformer block. Because the conv module depends on the previous frame information, it needs to cache the previous input to save computing time.  

3. layers_output_cache: List[paddle.Tensor]    
It mainly stores the historical output of the current conformer block, so that it can be spliced with the output of the current conformer block as the input of the next conformer block.  

A non streaming conformer model can be transformed into a streaming conformer model by combining the above three points.

# 5. Note

# 6. Reference       
[1] Chao Yang. http://placebokkk.github.io/wenet/2021/06/04/asr-wenet-nn-1.html    
[2] Gulati A, Qin J, Chiu C C, et al. Conformer: Convolution-augmented transformer for speech recognition[J]. arXiv preprint arXiv:2005.08100, 2020.    
[3] Graves A, Fernández S, Gomez F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd international conference on Machine learning. 2006: 369-376.    