未验证 提交 5e2144c8 编写于 作者: Z Zth9730 提交者: GitHub

add PP-ASR (#5559)

* add PP-ASR

* add PP-ASR

* add PP-ASR
上级 c676b870
import gradio as gr
import os
from paddlespeech.cli.asr.infer import ASRExecutor
from paddlespeech.cli.text.infer import TextExecutor
import librosa
import soundfile as sf
def model_inference(audio):
asr = ASRExecutor()
text_punc = TextExecutor()
if not isinstance(audio, str):
audio = str(audio.name)
y, sr = librosa.load(audio)
if sr != 16000: # Optional resample to 16000
y = librosa.resample(y, sr, 16000)
sf.write(audio, y, 16000)
result = asr(audio_file=audio,
model='conformer_online_wenetspeech',
device="cpu")
result = text_punc(
text=result, model='ernie_linear_p7_wudao', device="cpu")
return result
def clear_all():
return None, None, None, None
with gr.Blocks() as demo:
gr.Markdown("ASR")
with gr.Column(scale=1, min_width=100):
audio_input = gr.Audio(
value='https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav',
type="file",
label=" Input From File")
micro_input = gr.inputs.Audio(
source="microphone", type='filepath', label="Input From Mic")
with gr.Row():
btn1 = gr.Button("Clear")
btn2 = gr.Button("Submit File")
btn3 = gr.Button("Submit Micro")
audio_text_output = gr.Textbox(placeholder="Result...", lines=10)
micro_text_output = gr.Textbox(placeholder="Micro Result...", lines=10)
btn3.click(
fn=model_inference,
inputs=[micro_input],
outputs=micro_text_output,
scroll_to_output=True)
btn2.click(
fn=model_inference,
inputs=[audio_input],
outputs=audio_text_output,
scroll_to_output=True)
btn1.click(
fn=clear_all,
inputs=None,
outputs=[
audio_input, micro_input, audio_text_output, micro_text_output
])
gr.Button.style(1)
demo.launch(share=True)
【PP-ASR-App-YAML】
APP_Info:
title: PP-ASR-App
colorFrom: blue
colorTo: yellow
sdk: gradio
sdk_version: 3.4.1
app_file: app.py
license: apache-2.0
device: cpu
\ No newline at end of file
paddlepaddle==2.3.2
paddlespeech==1.2.0
paddleaudio==1.0.1
soundfile==0.11.0
librosa==0.8.1
\ No newline at end of file
## 推理 Benchmark
### 1.1 软硬件环境
* PP-ASR 模型推理速度测试采用单卡V100,使用 batch size=1 进行测试,使用 CUDA 10.2, CUDNN 7.5.1
### 1.2 数据集
PP-ASR 模型使用 wenetspeech 数据集中的 train 作为训练集,使用 aishell1 中的 test 作为测试集.
### 1.3 指标
| Model | Decoding Method | Chunk Size | CER | RTF |
| --- | --- | --- | --- | --- |
| conformer | attention | 16 | 0.056273 | 0.0003696 |
| conformer | ctc_greedy_search | 16 | 0.078918 | 0.0001571|
| conformer | ctc_prefix_beam_search | 16 | 0.079080 | 0.0002221 |
| conformer | attention_rescoring | 16 | 0.054401 | 0.0002569 |
| Model | Decoding Method | Chunk Size | CER | RTF |
| --- | --- | --- | --- | --- |
| conformer | attention | -1 | 0.050767 | 0.0003589 |
| conformer | ctc_greedy_search | -1 | 0.061884 | 0.0000435 |
| conformer | ctc_prefix_beam_search | -1 | 0.062056 | 0.0001934|
| conformer | attention_rescoring | -1 | 0.052110 |0.0002103|
## 2. 相关使用说明
请参考:https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/wenetspeech/asr1
## Inference benchmark
### 1.1 Software and hardware environment
* PP-ASR model inference speed test using single GPU V100, using batch size=1, using CUDA 10.2, CUDNN 7.5.1.
### 1.2 Datasets
PP-ASR model uses the train from wenetspeech as the training set and the test from aishell1 as the test set.
### 1.3 Performance
| Model | Decoding Method | Chunk Size | CER | RTF |
| --- | --- | --- | --- | --- |
| conformer | attention | 16 | 0.056273 | 0.0003696 |
| conformer | ctc_greedy_search | 16 | 0.078918 | 0.0001571|
| conformer | ctc_prefix_beam_search | 16 | 0.079080 | 0.0002221 |
| conformer | attention_rescoring | 16 | 0.054401 | 0.0002569 |
| Model | Decoding Method | Chunk Size | CER | RTF |
| --- | --- | --- | --- | --- |
| conformer | attention | -1 | 0.050767 | 0.0003589 |
| conformer | ctc_greedy_search | -1 | 0.061884 | 0.0000435 |
| conformer | ctc_prefix_beam_search | -1 | 0.062056 | 0.0001934|
| conformer | attention_rescoring | -1 | 0.052110 |0.0002103|
## 2. Relevant instructions
Please refer to: https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/wenetspeech/asr1
## 流式Conformer语音识别模型
预训练模型下载:https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz
| 模型 | 参数量 | 数据增广 | 测试数据集 | 解码策略 | 流式块大小 | 字符错误率 |
| --- | --- | --- | --- | --- | --- | --- |
| conformer | 32.52 M | spec_aug | aishell1 | attention | 16 | 0.056273 |
| conformer | 32.52 M | spec_aug | aishell1 | ctc_greedy_search | 16 | 0.078918 |
| conformer | 32.52 M | spec_aug | aishell1 | ctc_prefix_beam_search | 16 | 0.079080 |
| conformer | 32.52 M | spec_aug | aishell1 | attention_rescoring | 16 | 0.054401 |
| 模型 | 参数量 | 数据增广 | 测试数据集 | 解码策略 | 流式块大小 | 字符错误率 |
| --- | --- | --- | --- | --- | --- | --- |
| conformer | 32.52 M | spec_aug | aishell1 | attention | -1 | 0.050767 |
| conformer | 32.52 M | spec_aug | aishell1 | ctc_greedy_search | -1 | 0.061884 |
| conformer | 32.52 M | spec_aug | aishell1 | ctc_prefix_beam_search | -1 | 0.062056 |
| conformer | 32.52 M | spec_aug | aishell1 | attention_rescoring | -1 | 0.052110 |
## Conformer Steaming Pretrained Model
Pretrain model from https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz
| Model | Params | Config | Augmentation| Test set | Decode method | Chunk Size | CER |
| --- | --- | --- | --- | --- | --- | --- | --- |
| conformer | 32.52 M | conf/chunk_conformer.yaml | spec_aug | aishell1 | attention | 16 | 0.056273 |
| conformer | 32.52 M | conf/chunk_conformer.yaml | spec_aug | aishell1 | ctc_greedy_search | 16 | 0.078918 |
| conformer | 32.52 M | conf/chunk_conformer.yaml | spec_aug | aishell1 | ctc_prefix_beam_search | 16 | 0.079080 |
| conformer | 32.52 M | conf/chunk_conformer.yaml | spec_aug | aishell1 | attention_rescoring | 16 | 0.054401 |
| Model | Params | Config | Augmentation| Test set | Decode method | Chunk Size | CER |
| --- | --- | --- | --- | --- | --- | --- | --- |
| conformer | 32.52 M | conf/chunk_conformer.yaml | spec_aug | aishell1 | attention | -1 | 0.050767 |
| conformer | 32.52 M | conf/chunk_conformer.yaml | spec_aug | aishell1 | ctc_greedy_search | -1 | 0.061884 |
| conformer | 32.52 M | conf/chunk_conformer.yaml | spec_aug | aishell1 | ctc_prefix_beam_search | -1 | 0.062056 |
| conformer | 32.52 M | conf/chunk_conformer.yaml | spec_aug | aishell1 | attention_rescoring | -1 | 0.052110 |
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"# 1. 模型介绍\n",
"## 1.1 简介\n",
"PP-ASR 是一个 提供 ASR 功能的工具。其提供了多种中文和英文的模型,支持模型的训练,并且支持使用命令行的方式进行模型的推理。 PP-ASR 也支持流式模型的部署,以及个性化场景的部署。 PP-ASR支持多种预训练模型:[released_model](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/released_model.md)。 其中效果较好的模型为支持流式 ASR 的 Conformer 模型。\n",
"\n",
"## 1.2 特点\n",
"语音识别的基本流程如下图所示:\n",
"<div align=center>\n",
"<img src=\"https://user-images.githubusercontent.com/87408988/168259962-cbe2008b-47b6-443d-9566-d77a5ca2eb25.png\"/>\n",
"<br>\n",
"</div>\n",
"<br></br>\n",
"PP-ASR 的主要特点如下:\n",
"\n",
"1. 提供在中/英文开源数据集 aishell (中文),wenetspeech(中文),librispeech (英文)上的预训练模型。模型包含 deepspeech2 模型以及 conformer/transformer 模型。\n",
"2. 支持中/英文的模型训练功能。\n",
"3. 支持命令行方式的模型推理,可使用 paddlespeech asr --model xxx --input xxx.wav 方式调用各个预训练模型进行推理。\n",
"4. 支持流式 ASR 的服务部署,也支持输出时间戳。\n",
"5. 支持个性化场景的部署。\n",
"\n",
"更多内容欢迎来 [PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/paddlespeech) 进行体验!\n",
"\n",
"\n",
"# 2. 模型效果及应用场景\n",
"## 2.1 流式语音识别任务\n",
"\n",
"语音识别(Automatic Speech Recognition, ASR) 是一项从一段音频中提取出语言文字内容的任务。而流式语音识别则是用户将一整段语音分段,以流式输入,最后得到识别结果。\n",
"\n",
"实时语音识别引擎在获得分段的输入语音的同时,就可以同步地对这段数据进行特征提取和解码工作,而不用等到所有数据都获得后再开始工作。因此这样就可以在最后一段语音结束后,仅延迟很短的时间(也即等待处理最后一段语音数据以及获取最终结果的时间)即可返回最终识别结果。这种流式输入方式能缩短整体上获得最终结果的时间,极大地提升用户体验。 \n",
"\n",
"## 2.2 应用场景\n",
"1. 人机交互/语音输入法 \n",
"流式语音识别可以在用户说话的时候实时生成文字,加快了机器对人的反馈速度,使得用户的使用体验得到提升。\n",
"\n",
"\n",
"<div align=center>\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/6a68196417234818b3241616a1649741eef4f919c67141d9b9ad371780d110a8\" height=50%, width=50%/>\n",
"<br>\n",
" (百度智能音箱:https://dumall.baidu.com/)\n",
"</div>\n",
"\n",
" \n",
"2. 实时字幕/会议纪要 \n",
"在会议场景,边说话,边转写文本。\n",
"将会议、庭审、采访等场景的音频信息转换为文字,由实时语音识别服务实现,降低人工记录成本、提升效率。\n",
"\n",
"<div align=center>\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/546271f5bad341acb208d3d497874028da5a664e9e1e460eb61af6a742e89aeb\" height=70%, width=70%/>\n",
"<br>\n",
"(百度智能会议系统:一指禅)\n",
"</div>\n",
"\n",
"3. 同声翻译 \n",
"在机器进行同声翻译的时候,机器需要能实时识别出用户的说话内容,才能将说话的内容通过翻译模块实时翻译成别的语言。 \n",
"\n",
"<div align=center>\n",
"<img href=\"https://infoflow.baidu.com/audio-video/#/\" src=\"https://ai-studio-static-online.cdn.bcebos.com/7472f6f976e94e3288dacb0a8bffd9a824f31e392e48496d830f5f11626c0851\" height=50%, width=50%/>\n",
"<br>\n",
" (如流:智能会议 https://infoflow.baidu.com/audio-video/#/)\n",
"</div>\n",
"\n",
"4. 电话质检 \n",
"将坐席通话转成文字,由实时语音识别服务或录音文件识别服务实现,全面覆盖质检内容、提升质检效率。 \n",
"\n",
"<div align=center>\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/cbd0af3553ff4b8891bb6239069ad76d95bbc36fb98444378a3b3d716eb1fbcb\" height=40%, width=40%/>\n",
"</div>\n",
"\n",
"5. 语音消息转写 \n",
"将用户的语音信息转成文字信息,由一句话识别服务实现,提升用户阅读效率。 \n",
"\n",
"## 2.3 数据集\n",
"模型使用10000小时多领域中文语音识别数据集WenetSpeech。\n",
"\n",
"## 2.4 效果展示\n",
"网页上使用 asr server 的效果展示:[streaming_asr_demo_video](https://paddlespeech.readthedocs.io/en/latest/streaming_asr_demo_video.html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. 模型使用\n",
"## 3.1 模型推理\n",
"### 安装paddlespeech"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install paddlespeech"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 下载测试音频"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!wget https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 推理"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from paddlespeech.cli.asr.infer import ASRExecutor\n",
"audio = \"zh.wav\"\n",
"asr = ASRExecutor()\n",
"result = asr(audio_file=audio)\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.2 模型训练\n",
"[基于wenetspeech的流式 Conformer 训练](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/wenetspeech/asr1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"# 4. 流式 Conformer 模型原理"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"## 4.1 Confomer 模型结构\n",
"\n",
"<div align=center>\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/0fc40fc45a8f4046beea14eb69cfc1eee52196d9db974442a4c4df8007f8d70d\" height=1200, width=800 />\n",
"<br>\n",
"</div>\n",
" \n",
" \n",
"Conformer 主要由 Encoder 和 Decoder 两个部分组成,整体的模型结构和 Transformer 非常相似。 \n",
"Conformer 和 Transformer 有着相同的 Decoder,主要的区别有2点: \n",
"1. Conformer 的 Encoder 中包含了 conv 模块。该 conv 模块由 pointwise conv,GLU层,Depthwith conv, RELU层,以及第二层 pointwise conv, 共5个部分组成。 \n",
"2. Conformer 的 Encoder 使用了2层 FeedForward,分别位于每层 encoder的头和尾,并且设置每层输出的权重设置为0.5,整体类似于一个汉堡的结构。\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"\n",
"\n",
"## 4.2 流式 Conformer\n",
"流式解码主要分为2个步骤:\n",
"1. 说话中:使用 CTC prefix beam search 进行解码。\n",
"2. 说话结束:使用 CTC prefix beam search + attention_rescoring 进行解码。 其中 attention_rescoring 主要是用 decoder 对 ctc 的结果进行重打分,从而改变了 ctc 整句结果的候选排序。\n",
"\n",
"<div align=center>\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/c37339dbaf5c4c20a67b76d88c6730bb1cd93fc7f71b4179982f42365b969f49\" height=1200, width=800 />\n",
"<br>(图片来自\"Chao Yang http://placebokkk.github.io/wenet/2021/06/04/asr-wenet-nn-1.html\" )\n",
"</div>\n",
"\n",
"\n",
"因此,流式解码的核心在于支持流式的 CTC prefix beam search,而流式的 CTC prefix beam search 在于训练一个可以支持流式的 Encoder。\n",
"\n",
"\n",
"\n",
"\n",
"### 4.2.1 要点1:因果卷积,避免高时延\n",
"如果使用通常的卷积网络,如果使用了很多层卷积,网络输出的每一步将会大量依赖当前步后的多帧,从而增大了流式模型的时延,而 conformer 模型中存在大量的 conv 层,因此,如果使用普通的卷积, 流式 conformer 模型的时延会很大。 \n",
"为了解决这个问题,流式 conformer 使用了 因果卷积。因果卷积的每一步的输出只会依赖之前的时间点,而不会依赖之后的时间点,类似于卷积实现的 RNN 结构。从而避免了 conformer 模型的高时延。\n",
"<div align=center>\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/e77dddf4e0514724b3f24e9f6931aaf1054ebf0b4c1348b59aee6d3a13f833fe\" height=800, width=500 />\n",
"<br>(图片来自\"Bai S, Kolter J Z, Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling\" )\n",
"</div>\n",
"\n",
"\n",
"\n",
"\n",
"### 4.2.2 要点2:带有 mask 的 attention\n",
"\n",
"实现流式的 Encoder 的主要挑战是 conformer 的 attention 结构通常是使用全局的信息,如下图中第一张子图所示,从而无法实现流式。为了解决这个问题,流式 conformer 在训练的过程中会限制 attention 的作用范围。 \n",
"关于 attention 的作用范围,主要的策略如下图所示:\n",
"\n",
"<div align=center>\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/5a8cecc5d0b54898bd9ee4d4573433de992d68a234de418daaa02e6f80289b46\" height=1200, width=800 />\n",
"<br>(图片来自\"Chao Yang http://placebokkk.github.io/wenet/2021/06/04/asr-wenet-nn-1.html\" )\n",
"</div>\n",
"\n",
"为了尽可能多地使用语音地上下文信息,我们一般使用第三种 attention 作用范围。 \n",
"在训练的过程中,为了增强模型的健壮性,同时也让模型在解码过程中可以适用于多种 chunk size, 对于每个 batch 的数据,会采用随机的 chunk size 大小进行训练。 \n",
"而在解码的过程中,我们使用固定的 chunk size 进行解码。\n",
"\n",
"### 4.2.3 要点 3: cache\n",
"conformer 在进行解码的过程中,会使用 cache 来减小冗余的计算量。 \n",
"conformer Encoder 的 cache 主要分为 3 个 部分: \n",
"1. subsampling_cache \n",
"2. conformer_cnn_cache \n",
"3. elayers_output_cache \n",
"```\n",
"\t\t# Feed forward overlap input step by step\n",
" for cur in range(0, num_frames - context + 1, stride):\n",
" end = min(cur + decoding_window, num_frames)\n",
" chunk_xs = xs[:, cur:end, :]\n",
" (y, subsampling_cache, elayers_output_cache,\n",
" conformer_cnn_cache) = self.forward_chunk(\n",
" chunk_xs, offset, required_cache_size, subsampling_cache,\n",
" elayers_output_cache, conformer_cnn_cache)\n",
" outputs.append(y)\n",
" offset += y.shape[1]\n",
" ys = paddle.cat(outputs, 1)\n",
"```\n",
"\n",
"<div align=center>\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/a8e0ff53e2b54fbfbc6f8715dfcba8a50d05b13228eb4ef598a0445336dd3a03\" height=1200, width=800 />\n",
"<br>(图片来自\"Chao Yang http://placebokkk.github.io/wenet/2021/06/04/asr-wenet-nn-1.html\" )\n",
"</div>\n",
"\n",
"\n",
"\n",
"1. subsampling cache: [paddle.Tensor] \n",
" subsampling的输出的 cache,即为第一个conformer block 的输入。 用于缓存输入的特征经过 subsampling 模块之后的结果, 而当前的输入 chunk 和 subsampling cache 合并作为 conformer encoder 的输入。conformer 使用的 subsampling 主要由于 2 层 cnn 和一层 linear 构成。 \n",
" \n",
"2. conformer_cnn_cache: List[paddle.Tensor] \n",
"主要存储每个 conformer block 当中 conv 模块的输入, 由于 conv 模块会依赖之前的帧信息,所以需要对之前的输入进行缓存,节约计算时间。 \n",
"\n",
"3. layers_output_cache: List[paddle.Tensor] \n",
"主要存储当前 conformer block 的历史输出, 从而可与当前 conformer block 的输出拼接后作为作为下一个 conformer block 的输入。 \n",
"\n",
"一个非流式的 conformer 模型通过结合以上的 3 个要点,就可以转变为流式的 conformer 模型。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 5. 注意事项"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"# 6. 引用 \n",
"[1] Chao Yang. http://placebokkk.github.io/wenet/2021/06/04/asr-wenet-nn-1.html \n",
"[2] Gulati A, Qin J, Chiu C C, et al. Conformer: Convolution-augmented transformer for speech recognition[J]. arXiv preprint arXiv:2005.08100, 2020. \n",
"[3] Graves A, Fernández S, Gomez F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd international conference on Machine learning. 2006: 369-376. "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "py35-paddle1.2.0"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"# 1. Introduction\n",
"## 1.1 Overview\n",
"PP-ASR provides a tool for ASR functions. It provides a variety models in Chinese and English, supports model training, and supports model inference using the command line. PP-ASR also supports the deployment of streaming models and the deployment of personalized scenarios. PP-ASR supports multiple pre-training models: [released_model](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/released_model.md). The better model is Conformer model which supports streaming ASR.\n",
"\n",
"## 1.2 Features\n",
"The basic process of speech recognition is shown in the following figure: \n",
"<div align=center>\n",
"<img src=\"https://user-images.githubusercontent.com/87408988/168259962-cbe2008b-47b6-443d-9566-d77a5ca2eb25.png\"/>\n",
"<br>\n",
"</div>\n",
"<br></br>\n",
"The main features of PP-ASR are as follows:\n",
"\n",
"1. Provide pre-trained models on Chinese/English open source datasets: aishell1(Chinese), wenetspeech (Chinese), librispeech (English). The model includes the deepspeech2 model and the conformer/transformer model.\n",
"2. Support Chinese/English model training function.\n",
"3. Support command line model inference, you can use paddlespeech asr --model xxx --input xxx.wav to call each pre-trained model for inference.\n",
"4. Support the service deployment of streaming ASR, and also support the output of timestamps.\n",
"5. Support the deployment of personalized scenarios.\n",
"\n",
"Welcome to [PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/paddlespeech) for more experience!\n",
"\n",
"\n",
"# 2. Model Representation and Application Scenarios\n",
"## 2.1 Streaming Speech Recognition Task\n",
"\n",
"Automatic Speech Recognition (ASR) is a task to extract language and text content from a piece of speech. The streaming speech recognition is that the user segments a whole speech and inputs it in streaming mode, and finally gets the recognition result.\n",
"\n",
"The real-time speech recognition engine can simultaneously extract and decode the features of the segmented input speech without waiting for all the data to be obtained. Therefore, after the last speech, the final recognition result can be returned only after a short delay (that is, the time to wait for processing the last speech segment and obtaining the final result). This streaming input mode can shorten the overall time to obtain the final results, and greatly improve the user experience.\n",
"\n",
"## 2.2 Application Scenario\n",
"1. Human–Computer Interaction/Speech Input \n",
"Streaming speech recognition can generate text in real time when users speak, speeding up the feedback speed of machines to people, and improving the user experience.\n",
"\n",
"\n",
"<div align=center>\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/6a68196417234818b3241616a1649741eef4f919c67141d9b9ad371780d110a8\" height=50%, width=50%/>\n",
"<br>\n",
" (Baidu smart audio: https://dumall.baidu.com/)\n",
"</div>\n",
"\n",
" \n",
"2. Real Time Subtitles/Meeting Minutes \n",
"In the meeting scene, speak while transcribing the text.\n",
"Convert audio information of meetings, court trials, interviews and other scenes into text, which is realized by real-time speech recognition services, reducing manual recording costs and improving efficiency.\n",
"\n",
"<div align=center>\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/546271f5bad341acb208d3d497874028da5a664e9e1e460eb61af6a742e89aeb\" height=70%, width=70%/>\n",
"<br>\n",
"(Baidu Intelligent Conference System: One Finger Zen)\n",
"</div>\n",
"\n",
"3. Simultaneous Translation\n",
"When the machine performs simultaneous translation, the machine needs to be able to recognize the user's speech content in real time, so as to translate the speech content into other languages in real time through the translation module. \n",
"\n",
"<div align=center>\n",
"<img href=\"https://infoflow.baidu.com/audio-video/#/\" src=\"https://ai-studio-static-online.cdn.bcebos.com/7472f6f976e94e3288dacb0a8bffd9a824f31e392e48496d830f5f11626c0851\" height=50%, width=50%/>\n",
"<br>\n",
" (Ruliu: intelligent conference https://infoflow.baidu.com/audio-video/#/)\n",
"</div>\n",
"\n",
"4. Telephone Quality Inspection \n",
"Turn the call into text, which is realized by real-time speech recognition service or recording file recognition service, to comprehensively cover the quality inspection content and improve the quality inspection efficiency. \n",
"\n",
"<div align=center>\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/cbd0af3553ff4b8891bb6239069ad76d95bbc36fb98444378a3b3d716eb1fbcb\" height=40%, width=40%/>\n",
"</div>\n",
"\n",
"5. Speech Message Transfer \n",
"Turn the user's speech information into text information, which is realized by one sentence recognition service, and improve the user's reading efficiency. \n",
"\n",
"## 2.3 Datasets\n",
"The model uses the 10000 hour multi-domain Chinese speech recognition dataset wenetspeech。\n",
"\n",
"## 2.4 Demonstration\n",
"The effect of using asr server on the webpage is shown as follows:[streaming_asr_demo_video](https://paddlespeech.readthedocs.io/en/latest/streaming_asr_demo_video.html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. Model Usage\n",
"## 3.1 Model Inference\n",
"### install paddlespeech"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install paddlespeech"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### download test audio"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!wget https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### inference"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from paddlespeech.cli.asr.infer import ASRExecutor\n",
"audio = \"zh.wav\"\n",
"asr = ASRExecutor()\n",
"result = asr(audio_file=audio)\n",
"print(result)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3.2 model training\n",
"[Streaming conformer training based on wenetspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/wenetspeech/asr1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"# 4. Principle of Streaming Conformer Model"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"## 4.1 Confomer Model Structure\n",
"\n",
"<div align=center>\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/0fc40fc45a8f4046beea14eb69cfc1eee52196d9db974442a4c4df8007f8d70d\" height=1200, width=800 />\n",
"<br>\n",
"</div>\n",
" \n",
" \n",
"Conformer is mainly composed of Encoder and Decoder. The overall model structure is very similar to Transformer. \n",
"Conformer and Transformer have the same Decoder, with two main differences: \n",
"1. The Encoder of the Conformer contains the conv module. The conv module consists of five parts: pointwise conv, GLU layer, Depthwith conv, RELU layer, and the second pointwise conv layer. \n",
"2. The Encoder of Conform uses two layers of FeedForward, which are located at the head and tail of each layer of encoder respectively. The weight of each layer output is set to 0.5, which is similar to the structure of a hamburger as a whole.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"\n",
"\n",
"## 4.2 Streaming Conformer\n",
"Streaming decoding is mainly divided into two steps:\n",
"1. While speaking: use CTC prefix beam search to decode\n",
"2. End of speech: use CTC prefix beam search + attention_rescoring to decode. attention_rescoring mainly uses decoder to re-score ctc results, thus changing the candidate ranking of whole sentence ctc results.\n",
"\n",
"<div align=center>\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/c37339dbaf5c4c20a67b76d88c6730bb1cd93fc7f71b4179982f42365b969f49\" height=1200, width=800 />\n",
"<br>(image from \"Chao Yang http://placebokkk.github.io/wenet/2021/06/04/asr-wenet-nn-1.html\" )\n",
"</div>\n",
"\n",
"\n",
"Therefore, the core of streaming decoding lies in supporting streaming CTC prefix beam search, and streaming CTC prefix beam search lies in training an encoder that can support streaming.\n",
"\n",
"\n",
"\n",
"\n",
"### 4.2.1 Point 1: Causal convolution to avoid high delay\n",
"For traditional convolution networks, if many layers of convolution are used, each step of the network output will rely heavily on the multiple frames after the current step, thus increasing the delay of the streaming model. However, there are a large number of conv layers in the conformer model. Therefore, if traditional convolution is used, the delay of the streaming conformer model will be large. \n",
"In order to solve this problem, stream conformer uses causal convolution. The output of each step of causal convolution will only depend on the previous time point, not the subsequent time point, similar to the RNN structure of convolution implementation. Thus, the high delay of the conformer model is avoided.\n",
"<div align=center>\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/e77dddf4e0514724b3f24e9f6931aaf1054ebf0b4c1348b59aee6d3a13f833fe\" height=800, width=500 />\n",
"<br>(image from \"Bai S, Kolter J Z, Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling\" )\n",
"</div>\n",
"\n",
"\n",
"\n",
"\n",
"### 4.2.2 Point 2: Attention with mask \n",
"\n",
"The main challenge to implement streaming encoder is that the attention structure of the conformer usually uses global information, as shown in the first sub figure in the following figure, so streaming cannot be implemented. In order to solve this problem, streaming conformer will limit the scope of attention during training. \n",
"The main strategies for the scope of attention are shown in the following figure:\n",
"\n",
"<div align=center>\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/5a8cecc5d0b54898bd9ee4d4573433de992d68a234de418daaa02e6f80289b46\" height=1200, width=800 />\n",
"<br>(image from \"Chao Yang http://placebokkk.github.io/wenet/2021/06/04/asr-wenet-nn-1.html\" )\n",
"</div>\n",
"\n",
"In order to use speech context information as much as possible, we generally use the third type of attention scope. \n",
"In the training process, in order to enhance the robustness of the model, and also make the model applicable to a variety of chunk sizes in the decoding process, for each batch of data, the random chunk size will be used for training. \n",
"In the decoding process, we use a fixed chunk size for decoding. \n",
"\n",
"### 4.2.3 Point 3: Cache\n",
"In the process of decoding, the conformer will use cache to reduce redundant computation. \n",
"The cache of the conformer encoder is mainly divided into three parts: \n",
"1. subsampling_cache \n",
"2. conformer_cnn_cache \n",
"3. elayers_output_cache \n",
"```\n",
"\t\t# Feed forward overlap input step by step\n",
" for cur in range(0, num_frames - context + 1, stride):\n",
" end = min(cur + decoding_window, num_frames)\n",
" chunk_xs = xs[:, cur:end, :]\n",
" (y, subsampling_cache, elayers_output_cache,\n",
" conformer_cnn_cache) = self.forward_chunk(\n",
" chunk_xs, offset, required_cache_size, subsampling_cache,\n",
" elayers_output_cache, conformer_cnn_cache)\n",
" outputs.append(y)\n",
" offset += y.shape[1]\n",
" ys = paddle.cat(outputs, 1)\n",
"```\n",
"\n",
"<div align=center>\n",
"<img src=\"https://ai-studio-static-online.cdn.bcebos.com/a8e0ff53e2b54fbfbc6f8715dfcba8a50d05b13228eb4ef598a0445336dd3a03\" height=1200, width=800 />\n",
"<br>(image from \"Chao Yang http://placebokkk.github.io/wenet/2021/06/04/asr-wenet-nn-1.html\" )\n",
"</div>\n",
"\n",
"\n",
"\n",
"1. subsampling cache: [paddle.Tensor] \n",
"The output cache of subsampling is the input of the first conformer block. It is used to cache the results of input features after passing through the subsampling module, while the current input chunk and subsampling cache are combined as the input of the conformer encoder. The subsampling module used by the conformer is mainly composed of two layers of cnn and one layer of linear. \n",
" \n",
"2. conformer_cnn_cache: List[paddle.Tensor] \n",
"It mainly stores the input of the conv module in each conformer block. Because the conv module depends on the previous frame information, it needs to cache the previous input to save computing time. \n",
"\n",
"3. layers_output_cache: List[paddle.Tensor] \n",
"It mainly stores the historical output of the current conformer block, so that it can be spliced with the output of the current conformer block as the input of the next conformer block. \n",
"\n",
"A non streaming conformer model can be transformed into a streaming conformer model by combining the above three points."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 5. Note"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"# 6. Reference \n",
"[1] Chao Yang. http://placebokkk.github.io/wenet/2021/06/04/asr-wenet-nn-1.html \n",
"[2] Gulati A, Qin J, Chiu C C, et al. Conformer: Convolution-augmented transformer for speech recognition[J]. arXiv preprint arXiv:2005.08100, 2020. \n",
"[3] Graves A, Fernández S, Gomez F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd international conference on Machine learning. 2006: 369-376. "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "py35-paddle1.2.0"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册