Merge branch 'develop' of github.com:SmileGoat/PaddleSpeech into add_fbank

a36ecccf · Yang Zhou · d568c873 · 813401d4 · a36ecccf · a36ecccf
47 changed file
--- a/README.md
+++ b/README.md
 ([简体中文](./README_cn.md)|English)
 <p align="center">
  <img src="./docs/images/PaddleSpeech_logo.png" />
 </p>
-<div align="center">  
-  <h3>
-  <a href="#quick-start"> Quick Start </a>
-  | <a href="#quick-start-server"> Quick Start Server </a>
-  | <a href="#documents"> Documents </a>
-  | <a href="#model-list"> Models List </a>
-</div>
------------------------------------------------------------------------------------
 <p align="center">
    <a href="./LICENSE"><img src="https://img.shields.io/badge/license-Apache%202-red.svg"></a>
@@ -28,6 +19,20 @@
    <a href="=https://pypi.org/project/paddlespeech/"><img src="https://static.pepy.tech/badge/paddlespeech"></a>
    <a href="https://huggingface.co/spaces"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"></a>
 </p>
+<div align="center">  
+<h3>
+  | <a href="#quick-start"> Quick Start </a>
+  | <a href="#quick-start-server"> Quick Start Server </a>
+  | <a href="#quick-start-streaming-server"> Quick Start Streaming Server</a>
+  |
+  </br>
+  | <a href="#documents"> Documents </a>
+  | <a href="#model-list"> Models List </a>
+  |
+</h3>
+</div>
 **PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models.
@@ -142,26 +147,6 @@ For more synthesized audios, please refer to [PaddleSpeech Text-to-Speech sample
 </div>
-### ⭐ Examples
- **[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo): Use PaddleSpeech TTS to generate virtual human voice.**
-<div align="center"><a href="https://www.bilibili.com/video/BV1cL411V71o?share_source=copy_web"><img src="https://ai-studio-static-online.cdn.bcebos.com/06fd746ab32042f398fb6f33f873e6869e846fe63c214596ae37860fe8103720" / width="500px"></a></div>
- [PaddleSpeech Demo Video](https://paddlespeech.readthedocs.io/en/latest/demo_video.html)
- **[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk): Use PaddleSpeech TTS and ASR to clone voice from videos.**
-<div align="center">
-<img src="https://raw.githubusercontent.com/jerryuhoo/VTuberTalk/main/gui/gui.png"  width = "500px"  />
-</div>
-### 🔥 Hot Activities
- 2021.12.21~12.24
-  4 Days Live Courses: Depth interpretation of PaddleSpeech!
-  **Courses videos and related materials: https://aistudio.baidu.com/aistudio/education/group/info/25130**
 ### Features
@@ -174,11 +159,22 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
  - 🔬  *Integration of mainstream models and datasets*: the toolkit implements modules that participate in the whole pipeline of the speech tasks, and uses mainstream datasets like LibriSpeech, LJSpeech, AIShell, CSMSC, etc. See also [model list](#model-list) for more details.
  - 🧩  *Cascaded models application*: as an extension of the typical traditional audio tasks, we combine the workflows of the aforementioned tasks with other fields like Natural language processing (NLP) and Computer Vision (CV).
-### Recent Update
+### 🔥 Hot Activities
 <!---
 2021.12.14: We would like to have an online courses to introduce basics and research of speech, as well as code practice with `paddlespeech`. Please pay attention to our [Calendar](https://www.paddlepaddle.org.cn/live).
 --->
+- 2021.12.21~12.24
+  4 Days Live Courses: Depth interpretation of PaddleSpeech!
+  **Courses videos and related materials: https://aistudio.baidu.com/aistudio/education/group/info/25130**
+### Recent Update
+- 👏🏻  2022.04.28: PaddleSpeech Streaming Server is available for Automatic Speech Recognition and Text-to-Speech.
 - 👏🏻  2022.03.28: PaddleSpeech Server is available for Audio Classification, Automatic Speech Recognition and Text-to-Speech.
 - 👏🏻  2022.03.28: PaddleSpeech CLI is available for Speaker Verification.
 - 🤗  2021.12.14: Our PaddleSpeech [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) and [TTS](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS) Demos on Hugging Face Spaces are available!
@@ -196,6 +192,7 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
 We strongly recommend our users to install PaddleSpeech in **Linux** with *python>=3.7*.
 Up to now, **Linux** supports CLI for the all our tasks, **Mac OSX** and **Windows** only supports PaddleSpeech CLI for Audio Classification, Speech-to-Text and Text-to-Speech. To install `PaddleSpeech`, please see [installation](./docs/source/install.md).
 <a name="quickstart"></a>
 ## Quick Start
@@ -238,7 +235,7 @@ paddlespeech tts --input "你好，欢迎使用飞桨深度学习框架！" --ou
 **Batch Process**
 ```
 echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts
-```  
+```
 **Shell Pipeline**   
 - ASR + Punctuation Restoration
@@ -257,16 +254,19 @@ If you want to try more functions like training and tuning, please have a look a
 Developers can have a try of our speech server with [PaddleSpeech Server Command Line](./paddlespeech/server/README.md).
 **Start server**     
 ```shell
 paddlespeech_server start --config_file ./paddlespeech/server/conf/application.yaml
 ```
 **Access Speech Recognition Services**     
 ```shell
 paddlespeech_client asr --server_ip 127.0.0.1 --port 8090 --input input_16k.wav
 ```
 **Access Text to Speech Services**     
 ```shell
 paddlespeech_client tts --server_ip 127.0.0.1 --port 8090 --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.wav
 ```
@@ -280,6 +280,37 @@ paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav
 For more information about server command lines, please see: [speech server demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/speech_server)
+<a name="quickstartstreamingserver"></a>
+## Quick Start Streaming Server
+Developers can have a try of  [streaming asr](./demos/streaming_asr_server/README.md) and [streaming tts](./demos/streaming_tts_server/README.md) server.
+**Start Streaming Speech Recognition Server**
+```
+paddlespeech_server start --config_file ./demos/streaming_asr_server/conf/application.yaml
+```
+**Access Streaming Speech Recognition Services**     
+```
+paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input input_16k.wav
+```
+**Start Streaming Text to Speech  Server**
+```
+paddlespeech_server start --config_file ./demos/streaming_tts_server/conf/tts_online_application.yaml
+```
+**Access Streaming Text to Speech Services**     
+```
+paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.wav
+```
+For more information please see:  [streaming asr](./demos/streaming_asr_server/README.md) and [streaming tts](./demos/streaming_tts_server/README.md) 
 <a name="ModelList"></a>
 ## Model List
@@ -589,6 +620,21 @@ Normally, [Speech SoTA](https://paperswithcode.com/area/speech), [Audio SoTA](ht
 The Text-to-Speech module is originally called [Parakeet](https://github.com/PaddlePaddle/Parakeet), and now merged with this repository. If you are interested in academic research about this task, please see [TTS research overview](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/docs/source/tts#overview). Also, [this document](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/tts/models_introduction.md) is a good guideline for the pipeline components.
+## ⭐ Examples
+- **[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo): Use PaddleSpeech TTS to generate virtual human voice.**
+<div align="center"><a href="https://www.bilibili.com/video/BV1cL411V71o?share_source=copy_web"><img src="https://ai-studio-static-online.cdn.bcebos.com/06fd746ab32042f398fb6f33f873e6869e846fe63c214596ae37860fe8103720" / width="500px"></a></div>
+- [PaddleSpeech Demo Video](https://paddlespeech.readthedocs.io/en/latest/demo_video.html)
+- **[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk): Use PaddleSpeech TTS and ASR to clone voice from videos.**
+<div align="center">
+<img src="https://raw.githubusercontent.com/jerryuhoo/VTuberTalk/main/gui/gui.png"  width = "500px"  />
+</div>
 ## Citation
 To cite PaddleSpeech for research, please use the following format.
@@ -655,7 +701,6 @@ You are warmly welcome to submit questions in [discussions](https://github.com/P
 ## Acknowledgement
 - Many thanks to [yeyupiaoling](https://github.com/yeyupiaoling)/[PPASR](https://github.com/yeyupiaoling/PPASR)/[PaddlePaddle-DeepSpeech](https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech)/[VoiceprintRecognition-PaddlePaddle](https://github.com/yeyupiaoling/VoiceprintRecognition-PaddlePaddle)/[AudioClassification-PaddlePaddle](https://github.com/yeyupiaoling/AudioClassification-PaddlePaddle) for years of attention, constructive advice and great help.
 - Many thanks to [mymagicpower](https://github.com/mymagicpower) for the Java implementation of ASR upon [short](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_sdk) and [long](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_long_audio_sdk) audio files.
 - Many thanks to [JiehangXie](https://github.com/JiehangXie)/[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo) for developing Virtual Uploader(VUP)/Virtual YouTuber(VTuber) with PaddleSpeech TTS function.

--- a/README_cn.md
+++ b/README_cn.md
@@ -2,26 +2,45 @@
 <p align="center">
  <img src="./docs/images/PaddleSpeech_logo.png" />
 </p>
-<div align="center">  
-  <h3>
-  <a href="#quick-start"> 快速开始 </a>
-  | <a href="#quick-start-server"> 快速使用服务 </a>
-  | <a href="#documents"> 教程文档 </a>
-  | <a href="#model-list"> 模型列表 </a>
-</div>
------------------------------------------------------------------------------------
 <p align="center">
    <a href="./LICENSE"><img src="https://img.shields.io/badge/license-Apache%202-red.svg"></a>
-    <a href="support os"><img src="https://img.shields.io/badge/os-linux-yellow.svg"></a>
+    <a href="https://github.com/PaddlePaddle/PaddleSpeech/releases"><img src="https://img.shields.io/github/v/release/PaddlePaddle/PaddleSpeech?color=ffa"></a>
+    <a href="support os"><img src="https://img.shields.io/badge/os-linux%2C%20win%2C%20mac-pink.svg"></a>
    <a href=""><img src="https://img.shields.io/badge/python-3.7+-aff.svg"></a>
    <a href="https://github.com/PaddlePaddle/PaddleSpeech/graphs/contributors"><img src="https://img.shields.io/github/contributors/PaddlePaddle/PaddleSpeech?color=9ea"></a>
    <a href="https://github.com/PaddlePaddle/PaddleSpeech/commits"><img src="https://img.shields.io/github/commit-activity/m/PaddlePaddle/PaddleSpeech?color=3af"></a>
    <a href="https://github.com/PaddlePaddle/PaddleSpeech/issues"><img src="https://img.shields.io/github/issues/PaddlePaddle/PaddleSpeech?color=9cc"></a>
    <a href="https://github.com/PaddlePaddle/PaddleSpeech/stargazers"><img src="https://img.shields.io/github/stars/PaddlePaddle/PaddleSpeech?color=ccf"></a>
+    <a href="=https://pypi.org/project/paddlespeech/"><img src="https://img.shields.io/pypi/dm/PaddleSpeech"></a>
+    <a href="=https://pypi.org/project/paddlespeech/"><img src="https://static.pepy.tech/badge/paddlespeech"></a>
    <a href="https://huggingface.co/spaces"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"></a>
 </p>
+<div align="center">  
+<h3>
+  <a href="#quick-start"> Quick Start </a>
+  | <a href="#quick-start-server"> Quick Start Server </a>
+  | <a href="#quick-start-streaming-server"> Quick Start Streaming Server</a>
+  </br>
+  <a href="#documents"> Documents </a>
+  | <a href="#model-list"> Models List </a>
+</h3>
+</div>
+------------------------------------------------------------------------------------
+<div align="center">  
+  <h3>
+  <a href="#quick-start"> 快速开始 </a>
+  | <a href="#quick-start-server"> 快速使用服务 </a>
+  | <a href="#quick-start-streaming-server"> 快速使用流式服务 </a>
+  | <a href="#documents"> 教程文档 </a>
+  | <a href="#model-list"> 模型列表 </a>
+</div>
 <!---
 from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readmes-readable.md
@@ -31,6 +50,8 @@ from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readme
 4.What is the goal of this project?
 -->
 **PaddleSpeech** 是基于飞桨 [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) 的语音方向的开源模型库，用于语音和音频中的各种关键任务的开发，包含大量基于深度学习前沿和有影响力的模型，一些典型的应用示例如下：
 ##### 语音识别
@@ -57,7 +78,6 @@ from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readme
      </td>
      <td>我认为跑步最重要的就是给我带来了身体健康。</td>
    </tr>
  </tbody>
 </table>
@@ -143,19 +163,6 @@ from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readme
 </div>
-### ⭐ 应用案例
- **[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo): 使用 PaddleSpeech 的语音合成模块生成虚拟人的声音。**
-<div align="center"><a href="https://www.bilibili.com/video/BV1cL411V71o?share_source=copy_web"><img src="https://ai-studio-static-online.cdn.bcebos.com/06fd746ab32042f398fb6f33f873e6869e846fe63c214596ae37860fe8103720" / width="500px"></a></div>
- [PaddleSpeech 示例视频](https://paddlespeech.readthedocs.io/en/latest/demo_video.html)
- **[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk): 使用 PaddleSpeech 的语音合成和语音识别从视频中克隆人声。**
-<div align="center">
-<img src="https://raw.githubusercontent.com/jerryuhoo/VTuberTalk/main/gui/gui.png"  width = "500px"  />
-</div>
 ### 🔥 热门活动
@@ -164,27 +171,32 @@ from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readme
  4 日直播课: 深度解读 PaddleSpeech 语音技术!
  **直播回放与课件资料: https://aistudio.baidu.com/aistudio/education/group/info/25130**
-### 特性
-本项目采用了易用、高效、灵活以及可扩展的实现，旨在为工业应用、学术研究提供更好的支持，实现的功能包含训练、推断以及测试模块，以及部署过程，主要包括
- 📦 **易用性**: 安装门槛低，可使用 [CLI](#quick-start) 快速开始。
- 🏆 **对标 SoTA**: 提供了高速、轻量级模型，且借鉴了最前沿的技术。
- 💯 **基于规则的中文前端**: 我们的前端包含文本正则化和字音转换（G2P）。此外，我们使用自定义语言规则来适应中文语境。
- **多种工业界以及学术界主流功能支持**:
-  - 🛎️ 典型音频任务: 本工具包提供了音频任务如音频分类、语音翻译、自动语音识别、文本转语音、语音合成等任务的实现。
-  - 🔬 主流模型及数据集: 本工具包实现了参与整条语音任务流水线的各个模块，并且采用了主流数据集如 LibriSpeech、LJSpeech、AIShell、CSMSC，详情请见 [模型列表](#model-list)。
-  - 🧩 级联模型应用: 作为传统语音任务的扩展，我们结合了自然语言处理、计算机视觉等任务，实现更接近实际需求的产业级应用。
 ### 近期更新
 <!---
 2021.12.14: We would like to have an online courses to introduce basics and research of speech, as well as code practice with `paddlespeech`. Please pay attention to our [Calendar](https://www.paddlepaddle.org.cn/live).
 --->
+- 👏🏻 2022.04.28: PaddleSpeech Streaming Server 上线! 覆盖了语音识别和语音合成。
 - 👏🏻 2022.03.28: PaddleSpeech Server 上线! 覆盖了声音分类、语音识别、以及语音合成。
 - 👏🏻 2022.03.28: PaddleSpeech CLI 上线声纹验证。
 - 🤗  2021.12.14: Our PaddleSpeech [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) and [TTS](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS) Demos on Hugging Face Spaces are available!
 - 👏🏻 2021.12.10: PaddleSpeech CLI 上线！覆盖了声音分类、语音识别、语音翻译（英译中）以及语音合成。
+### 特性
+本项目采用了易用、高效、灵活以及可扩展的实现，旨在为工业应用、学术研究提供更好的支持，实现的功能包含训练、推断以及测试模块，以及部署过程，主要包括
+- 📦 **易用性**: 安装门槛低，可使用 [CLI](#quick-start) 快速开始。
+- 🏆 **对标 SoTA**: 提供了高速、轻量级模型，且借鉴了最前沿的技术。
+- 💯 **基于规则的中文前端**: 我们的前端包含文本正则化和字音转换（G2P）。此外，我们使用自定义语言规则来适应中文语境。
+- **多种工业界以及学术界主流功能支持**:
+  - 🛎️ 典型音频任务: 本工具包提供了音频任务如音频分类、语音翻译、自动语音识别、文本转语音、语音合成等任务的实现。
+  - 🔬 主流模型及数据集: 本工具包实现了参与整条语音任务流水线的各个模块，并且采用了主流数据集如 LibriSpeech、LJSpeech、AIShell、CSMSC，详情请见 [模型列表](#model-list)。
+  - 🧩 级联模型应用: 作为传统语音任务的扩展，我们结合了自然语言处理、计算机视觉等任务，实现更接近实际需求的产业级应用。
 ### 技术交流群
 微信扫描二维码（好友申请通过后回复【语音】）加入官方交流群，获得更高效的问题答疑，与各行各业开发者充分交流，期待您的加入。
@@ -192,11 +204,13 @@ from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readme
 <img src="https://raw.githubusercontent.com/yt605155624/lanceTest/main/images/wechat_4.jpg"  width = "300"  />
 </div>
 ## 安装
 我们强烈建议用户在 **Linux** 环境下，*3.7* 以上版本的 *python* 上安装 PaddleSpeech。
 目前为止，**Linux** 支持声音分类、语音识别、语音合成和语音翻译四种功能，**Mac OSX、 Windows** 下暂不支持语音翻译功能。 想了解具体安装细节，可以参考[安装文档](./docs/source/install_cn.md)。
 ## 快速开始
 安装完成后，开发者可以通过命令行快速开始，改变 `--input` 可以尝试用自己的音频或文本测试。
@@ -232,7 +246,7 @@ paddlespeech tts --input "你好，欢迎使用百度飞桨深度学习框架！
 **批处理**
 ```
 echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts
-```  
+```
 **Shell管道**
 ASR + Punc:
@@ -269,6 +283,38 @@ paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav
 更多服务相关的命令行使用信息，请参考 [demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/speech_server)
+<a name="quickstartstreamingserver"></a>
+## 快速使用流式服务
+开发者可以尝试[流式ASR](./demos/streaming_asr_server/README.md)和 [流式TTS](./demos/streaming_tts_server/README.md)服务.
+**启动流式ASR服务**
+```
+paddlespeech_server start --config_file ./demos/streaming_asr_server/conf/application.yaml
+```
+**访问流式ASR服务**     
+```
+paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input input_16k.wav
+```
+**启动流式TTS服务**
+```
+paddlespeech_server start --config_file ./demos/streaming_tts_server/conf/tts_online_application.yaml
+```
+**访问流式TTS服务**     
+```
+paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.wav
+```
+更多信息参看： [流式 ASR](./demos/streaming_asr_server/README.md) 和 [流式 TTS](./demos/streaming_tts_server/README.md) 
+<a name="modulelist"></a>
 ## 模型列表
 PaddleSpeech 支持很多主流的模型，并提供了预训练模型，详情请见[模型列表](./docs/source/released_model.md)。
@@ -582,6 +628,21 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
 语音合成模块最初被称为 [Parakeet](https://github.com/PaddlePaddle/Parakeet)，现在与此仓库合并。如果您对该任务的学术研究感兴趣，请参阅 [TTS 研究概述](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/docs/source/tts#overview)。此外，[模型介绍](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/tts/models_introduction.md) 是了解语音合成流程的一个很好的指南。
+## ⭐ 应用案例
+- **[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo): 使用 PaddleSpeech 的语音合成模块生成虚拟人的声音。**
+<div align="center"><a href="https://www.bilibili.com/video/BV1cL411V71o?share_source=copy_web"><img src="https://ai-studio-static-online.cdn.bcebos.com/06fd746ab32042f398fb6f33f873e6869e846fe63c214596ae37860fe8103720" / width="500px"></a></div>
+- [PaddleSpeech 示例视频](https://paddlespeech.readthedocs.io/en/latest/demo_video.html)
+- **[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk): 使用 PaddleSpeech 的语音合成和语音识别从视频中克隆人声。**
+<div align="center">
+<img src="https://raw.githubusercontent.com/jerryuhoo/VTuberTalk/main/gui/gui.png"  width = "500px"  />
+</div>
 ## 引用
 要引用 PaddleSpeech 进行研究，请使用以下格式进行引用。
@@ -658,6 +719,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
 - 非常感谢 [jerryuhoo](https://github.com/jerryuhoo)/[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk) 基于 PaddleSpeech 的 TTS GUI 界面和基于 ASR 制作数据集的相关代码。
 此外，PaddleSpeech 依赖于许多开源存储库。有关更多信息，请参阅 [references](./docs/source/reference.md)。
 ## License

--- a/demos/speech_recognition/README.md
+++ b/demos/speech_recognition/README.md
@@ -24,13 +24,13 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
 - Command Line(Recommended)
  ```bash
  # Chinese
-  paddlespeech asr --input ./zh.wav
+  paddlespeech asr --input ./zh.wav -v
  # English
-  paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav
+  paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav -v
  # Chinese ASR + Punctuation Restoration
-  paddlespeech asr --input ./zh.wav | paddlespeech text --task punc
+  paddlespeech asr --input ./zh.wav -v | paddlespeech text --task punc -v
  ```
-  (It doesn't matter if package `paddlespeech-ctcdecoders` is not found, this package is optional.)
+  (If you don't want to see the log information, you can remove "-v". Besides, it doesn't matter if package `paddlespeech-ctcdecoders` is not found, this package is optional.)
  Usage:
  ```bash
@@ -45,6 +45,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
  - `ckpt_path`: Model checkpoint. Use pretrained model when it is None. Default: `None`.
  - `yes`: No additional parameters required. Once set this parameter, it means accepting the request of the program by default, which includes transforming the audio sample rate. Default: `False`.
  - `device`: Choose device to execute model inference. Default: default device of paddlepaddle in current environment.
+  - `verbose`: Show the log information.
  Output:
  ```bash
@@ -84,8 +85,12 @@ Here is a list of pretrained models released by PaddleSpeech that can be used by
 | Model | Language | Sample Rate
 | :--- | :---: | :---: |
-| conformer_wenetspeech| zh| 16k
+| conformer_wenetspeech | zh | 16k
-| transformer_librispeech| en| 16k
+| conformer_online_multicn | zh | 16k
+| conformer_aishell | zh | 16k
+| conformer_online_aishell | zh | 16k
+| transformer_librispeech | en | 16k
+| deepspeech2online_wenetspeech | zh | 16k
 | deepspeech2offline_aishell| zh| 16k
 | deepspeech2online_aishell | zh | 16k
-|deepspeech2offline_librispeech|en| 16k
+| deepspeech2offline_librispeech | en | 16k
--- a/demos/speech_recognition/README_cn.md
+++ b/demos/speech_recognition/README_cn.md
@@ -22,13 +22,13 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
 - 命令行 (推荐使用)
  ```bash
  # 中文
-  paddlespeech asr --input ./zh.wav
+  paddlespeech asr --input ./zh.wav -v
  # 英文
-  paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav
+  paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav -v
  # 中文 + 标点恢复
-  paddlespeech asr --input ./zh.wav | paddlespeech text --task punc
+  paddlespeech asr --input ./zh.wav -v | paddlespeech text --task punc -v
  ```
-  (如果显示 `paddlespeech-ctcdecoders` 这个 python 包没有找到的 Error，没有关系，这个包是非必须的。)
+  (如果不想显示 log 信息，可以不使用"-v", 另外如果显示 `paddlespeech-ctcdecoders` 这个 python 包没有找到的 Error，没有关系，这个包是非必须的。)
  使用方法：
  ```bash
@@ -43,6 +43,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
  - `ckpt_path`：模型参数文件，若不设置则下载预训练模型使用，默认值：`None`。
  - `yes`；不需要设置额外的参数，一旦设置了该参数，说明你默认同意程序的所有请求，其中包括自动转换输入音频的采样率。默认值：`False`。
  - `device`：执行预测的设备，默认值：当前系统下 paddlepaddle 的默认 device。
+  - `verbose`: 如果使用，显示 logger 信息。
  输出：
  ```bash
@@ -82,7 +83,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
 | 模型 | 语言 | 采样率
 | :--- | :---: | :---: |
 | conformer_wenetspeech | zh | 16k
+| conformer_online_multicn | zh | 16k
+| conformer_aishell | zh | 16k
+| conformer_online_aishell | zh | 16k
 | transformer_librispeech | en | 16k
+| deepspeech2online_wenetspeech | zh | 16k
 | deepspeech2offline_aishell| zh| 16k
 | deepspeech2online_aishell | zh | 16k
 | deepspeech2offline_librispeech | en | 16k
--- a/demos/streaming_asr_server/README.md
+++ b/demos/streaming_asr_server/README.md
@@ -5,6 +5,7 @@
 ## Introduction
 This demo is an implementation of starting the streaming speech service and accessing the service. It can be achieved with a single command using `paddlespeech_server` and `paddlespeech_client` or a few lines of code in python.
+Streaming ASR server only support `websocket` protocol, and doesn't support `http` protocol.
 ## Usage
 ### 1. Installation
@@ -30,7 +31,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
 - Command Line (Recommended)
  ```bash
-  # start the service
+  # in PaddleSpeech/demos/streaming_asr_server start the service
   paddlespeech_server start --config_file ./conf/ws_conformer_application.yaml
  ```
@@ -110,11 +111,12 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
 - Python API
  ```python
+  # in PaddleSpeech/demos/streaming_asr_server directory
  from paddlespeech.server.bin.paddlespeech_server import ServerExecutor
  server_executor = ServerExecutor()
  server_executor(
-      config_file="./conf/ws_conformer_application.yaml", 
+      config_file="./conf/ws_conformer_application.yaml",
      log_file="./log/paddlespeech.log")
  ```
@@ -185,6 +187,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
 ### 4. ASR Client Usage
 **Note:** The response time will be slightly longer when using the client for the first time
 - Command Line (Recommended)
   ```
@@ -203,6 +206,8 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
  - `sample_rate`: Audio ampling rate, default: 16000.
  - `lang`: Language. Default: "zh_cn".
  - `audio_format`: Audio format. Default: "wav".
+  - `punc.server_ip`: punctuation server ip. Default: None.
+  - `punc.server_port`: punctuation server port. Default: None.
  Output:
  ```bash
@@ -274,10 +279,9 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
 - Python API
  ```python
-  from paddlespeech.server.bin.paddlespeech_client import ASRClientExecutor
+  from paddlespeech.server.bin.paddlespeech_client import ASROnlineClientExecutor
-  import json
-  asrclient_executor = ASRClientExecutor()
+  asrclient_executor = ASROnlineClientExecutor()
  res = asrclient_executor(
      input="./zh.wav",
      server_ip="127.0.0.1",
@@ -285,7 +289,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
      sample_rate=16000,
      lang="zh_cn",
      audio_format="wav")
-  print(res.json())
+  print(res)
  ```
  Output:
@@ -351,5 +355,4 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
        [2022-04-21 15:59:08,016] [    INFO] - receive msg={'asr_results': '我认为跑步最重要的就是给我带来了身体健康'}
        [2022-04-21 15:59:08,024] [    INFO] - receive msg={'asr_results': '我认为跑步最重要的就是给我带来了身体健康'}
        [2022-04-21 15:59:12,883] [    INFO] - final receive msg={'status': 'ok', 'signal': 'finished', 'asr_results': '我认为跑步最重要的就是给我带来了身体健康'}
-        [2022-04-21 15:59:12,884] [    INFO] - 我认为跑步最重要的就是给我带来了身体健康
+  ```
-  ```
\ No newline at end of file
--- a/demos/streaming_asr_server/README_cn.md
+++ b/demos/streaming_asr_server/README_cn.md
@@ -5,18 +5,26 @@
 ## 介绍
 这个demo是一个启动流式语音服务和访问服务的实现。 它可以通过使用`paddlespeech_server` 和 `paddlespeech_client`的单个命令或 python 的几行代码来实现。
+**流式语音识别服务只支持 `weboscket` 协议，不支持 `http` 协议。**
 ## 使用方法
 ### 1. 安装
-请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
+安装 PaddleSpeech 的详细过程请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md)。
 推荐使用 **paddlepaddle 2.2.1** 或以上版本。
-你可以从 medium，hard 三中方式中选择一种方式安装 PaddleSpeech。
+你可以从medium，hard 两种方式中选择一种方式安装 PaddleSpeech。
 ### 2. 准备配置文件
-配置文件可参见 `conf/ws_application.yaml` 和 `conf/ws_conformer_application.yaml` 。
-目前服务集成的模型有： DeepSpeech2和conformer模型。
+流式ASR的服务启动脚本和服务测试脚本存放在 `PaddleSpeech/demos/streaming_asr_server` 目录。
+下载好 `PaddleSpeech` 之后，进入到 `PaddleSpeech/demos/streaming_asr_server` 目录。
+配置文件可参见该目录下 `conf/ws_application.yaml` 和 `conf/ws_conformer_application.yaml` 。
+目前服务集成的模型有： DeepSpeech2和 conformer模型，对应的配置文件如下：
+* DeepSpeech: `conf/ws_application.yaml`
+* conformer: `conf/ws_conformer_application.yaml`
 这个 ASR client 的输入应该是一个 WAV 文件（`.wav`），并且采样率必须与模型的采样率相同。
@@ -30,7 +38,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
 - 命令行 (推荐使用)
  ```bash
-  # 启动服务
+  # 在 PaddleSpeech/demos/streaming_asr_server 目录启动服务
  paddlespeech_server start --config_file ./conf/ws_conformer_application.yaml
  ```
@@ -110,6 +118,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
 - Python API
  ```python
+  # 在 PaddleSpeech/demos/streaming_asr_server 目录
  from paddlespeech.server.bin.paddlespeech_server import ServerExecutor
  server_executor = ServerExecutor()
@@ -184,11 +193,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
  ```
 ### 4. ASR 客户端使用方法
 **注意：** 初次使用客户端时响应时间会略长
 - 命令行 (推荐使用)
   ```
   paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input ./zh.wav
   ```
    使用帮助:
@@ -204,6 +213,8 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
    - `sample_rate`: 音频采样率，默认值：16000。
    - `lang`: 模型语言，默认值：zh_cn。
    - `audio_format`: 音频格式，默认值：wav。
+    - `punc.server_ip` 标点预测服务的ip。默认是None。
+    - `punc.server_port` 标点预测服务的端口port。默认是None。
    输出:
@@ -276,7 +287,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
 - Python API
  ```python
  from paddlespeech.server.bin.paddlespeech_client import ASROnlineClientExecutor
-  import json
  asrclient_executor = ASROnlineClientExecutor()
  res = asrclient_executor(
@@ -286,7 +296,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
      sample_rate=16000,
      lang="zh_cn",
      audio_format="wav")
-  print(res.json())
+  print(res)
  ```
  输出:
@@ -352,5 +362,4 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
        [2022-04-21 15:59:08,016] [    INFO] - receive msg={'asr_results': '我认为跑步最重要的就是给我带来了身体健康'}
        [2022-04-21 15:59:08,024] [    INFO] - receive msg={'asr_results': '我认为跑步最重要的就是给我带来了身体健康'}
        [2022-04-21 15:59:12,883] [    INFO] - final receive msg={'status': 'ok', 'signal': 'finished', 'asr_results': '我认为跑步最重要的就是给我带来了身体健康'}
-        [2022-04-21 15:59:12,884] [    INFO] - 我认为跑步最重要的就是给我带来了身体健康
  ```
--- a/demos/streaming_asr_server/conf/application.yaml
+++ b/demos/streaming_asr_server/conf/application.yaml
+# This is the parameter configuration file for PaddleSpeech Serving.
+#################################################################################
+#                             SERVER SETTING                                    #
+#################################################################################
+host: 0.0.0.0
+port: 8090
+# The task format in the engin_list is: <speech task>_<engine type>
+# task choices = ['asr_online']
+# protocol = ['websocket'] (only one can be selected).
+# websocket only support online engine type.
+protocol: 'websocket'
+engine_list: ['asr_online']
+#################################################################################
+#                                ENGINE CONFIG                                  #
+#################################################################################
+################################### ASR #########################################
+################### speech task: asr; engine_type: online #######################
+asr_online:
+    model_type: 'conformer_online_multicn'
+    am_model: # the pdmodel file of am static model [optional]
+    am_params:  # the pdiparams file of am static model [optional]
+    lang: 'zh'
+    sample_rate: 16000
+    cfg_path: 
+    decode_method: 
+    force_yes: True
+    device: # cpu or gpu:id
+    am_predictor_conf:
+        device:  # set 'gpu:id' or 'cpu'
+        switch_ir_optim: True
+        glog_info: False  # True -> print glog
+        summary: True  # False -> do not show predictor config
+    chunk_buffer_conf:
+        window_n: 7     # frame
+        shift_n: 4      # frame
+        window_ms: 25   # ms
+        shift_ms: 10    # ms
+        sample_rate: 16000
+        sample_width: 2
\ No newline at end of file
--- a/demos/streaming_asr_server/conf/ws_application.yaml
+++ b/demos/streaming_asr_server/conf/ws_application.yaml
@@ -7,8 +7,8 @@ host: 0.0.0.0
 port: 8090
 # The task format in the engin_list is: <speech task>_<engine type>
-# task choices = ['asr_online', 'tts_online']
+# task choices = ['asr_online']
-# protocol = ['websocket', 'http'] (only one can be selected).
+# protocol = ['websocket'] (only one can be selected).
 # websocket only support online engine type.
 protocol: 'websocket'
 engine_list: ['asr_online']

--- a/demos/streaming_asr_server/conf/ws_conformer_application.yaml
+++ b/demos/streaming_asr_server/conf/ws_conformer_application.yaml
@@ -7,8 +7,8 @@ host: 0.0.0.0
 port: 8090
 # The task format in the engin_list is: <speech task>_<engine type>
-# task choices = ['asr_online', 'tts_online']
+# task choices = ['asr_online']
-# protocol = ['websocket', 'http'] (only one can be selected).
+# protocol = ['websocket'] (only one can be selected).
 # websocket only support online engine type.
 protocol: 'websocket'
 engine_list: ['asr_online']

--- a/demos/streaming_asr_server/web/templates/index.html
+++ b/demos/streaming_asr_server/web/templates/index.html
@@ -93,7 +93,7 @@
    function parseResult(data) {
      var data = JSON.parse(data)
-      var result = data.asr_results
+      var result = data.result
      console.log(result)
      $("#resultPanel").html(result)
    }
@@ -152,4 +152,4 @@
  </script>
 </body>
 </html>
\ No newline at end of file
--- a/demos/streaming_asr_server/websocket_client.py
+++ b/demos/streaming_asr_server/websocket_client.py
@@ -20,19 +20,23 @@ import logging
 import os
 from paddlespeech.cli.log import logger
-from paddlespeech.server.utils.audio_handler import ASRAudioHandler
+from paddlespeech.server.utils.audio_handler import ASRWsAudioHandler
 def main(args):
    logger.info("asr websocket client start")
-    handler = ASRAudioHandler("127.0.0.1", 8090)
+    handler = ASRWsAudioHandler(
+        args.server_ip,
+        args.port,
+        punc_server_ip=args.punc_server_ip,
+        punc_server_port=args.punc_server_port)
    loop = asyncio.get_event_loop()
    # support to process single audio file
    if args.wavfile and os.path.exists(args.wavfile):
        logger.info(f"start to process the wavscp: {args.wavfile}")
        result = loop.run_until_complete(handler.run(args.wavfile))
-        result = result["asr_results"]
+        result = result["result"]
        logger.info(f"asr websocket client finished : {result}")
    # support to process batch audios from wav.scp 
@@ -43,13 +47,29 @@ def main(args):
            for line in f:
                utt_name, utt_path = line.strip().split()
                result = loop.run_until_complete(handler.run(utt_path))
-                result = result["asr_results"]
+                result = result["result"]
                w.write(f"{utt_name} {result}\n")
 if __name__ == "__main__":
    logger.info("Start to do streaming asr client")
    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        '--server_ip', type=str, default='127.0.0.1', help='server ip')
+    parser.add_argument('--port', type=int, default=8090, help='server port')
+    parser.add_argument(
+        '--punc.server_ip',
+        type=str,
+        default=None,
+        dest="punc_server_ip",
+        help='Punctuation server ip')
+    parser.add_argument(
+        '--punc.port',
+        type=int,
+        default=8091,
+        dest="punc_server_port",
+        help='Punctuation server port')
    parser.add_argument(
        "--wavfile",
        action="store",

--- a/demos/streaming_tts_server/README.md
+++ b/demos/streaming_tts_server/README.md
@@ -15,19 +15,28 @@ You can choose one way from meduim and hard to install paddlespeech.
 ### 2. Prepare config File
-The configuration file can be found in `conf/tts_online_application.yaml` 。
+The configuration file can be found in `conf/tts_online_application.yaml`.
-Among them, `protocol` indicates the network protocol used by the streaming TTS service. Currently, both http and websocket are supported.
+- `protocol` indicates the network protocol used by the streaming TTS service. Currently, both **http and websocket** are supported.
-`engine_list` indicates the speech engine that will be included in the service to be started, in the format of `<speech task>_<engine type>`.
+- `engine_list` indicates the speech engine that will be included in the service to be started, in the format of `<speech task>_<engine type>`.
-This demo mainly introduces the streaming speech synthesis service, so the speech task should be set to `tts`.
+    - This demo mainly introduces the streaming speech synthesis service, so the speech task should be set to `tts`.
-Currently, the engine type supports two forms: **online**  and **online-onnx**. `online` indicates an engine that uses python for dynamic graph inference; `online-onnx` indicates an engine that uses onnxruntime for inference. The inference speed of online-onnx is faster.
+    - the engine type supports two forms: **online**  and **online-onnx**. `online` indicates an engine that uses python for dynamic graph inference; `online-onnx` indicates an engine that uses onnxruntime for inference. The inference speed of online-onnx is faster.
-Streaming TTS AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc model support: **hifigan and mb_melgan**
+- Streaming TTS engine AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc model support: **hifigan and mb_melgan**
+- In streaming am inference, one chunk of data is inferred at a time to achieve a streaming effect. Among them, `am_block` indicates the number of valid frames in the chunk, and `am_pad` indicates the number of frames added before and after am_block in a chunk. The existence of am_pad is used to eliminate errors caused by streaming inference and avoid the influence of streaming inference on the quality of synthesized audio.
+    - fastspeech2 does not support streaming am inference, so am_pad and am_block have no effect on it.
+    - fastspeech2_cnndecoder supports streaming inference. When am_pad=12, streaming inference synthesized audio is consistent with non-streaming synthesized audio.
+- In streaming voc inference, one chunk of data is inferred at a time to achieve a streaming effect. Where `voc_block` indicates the number of valid frames in the chunk, and `voc_pad` indicates the number of frames added before and after the voc_block in a chunk. The existence of voc_pad is used to eliminate errors caused by streaming inference and avoid the influence of streaming inference on the quality of synthesized audio.
+    - Both hifigan and mb_melgan support streaming voc inference.
+    - When the voc model is mb_melgan, when voc_pad=14, the synthetic audio for streaming inference is consistent with the non-streaming synthetic audio; the minimum voc_pad can be set to 7, and the synthetic audio has no abnormal hearing. If the voc_pad is less than 7, the synthetic audio sounds abnormal.
+    - When the voc model is hifigan, when voc_pad=20, the streaming inference synthetic audio is consistent with the non-streaming synthetic audio; when voc_pad=14, the synthetic audio has no abnormal hearing.
+- Inference speed: mb_melgan > hifigan; Audio quality: mb_melgan < hifigan
-### 3. Server Usage
+### 3. Streaming speech synthesis server and client using http protocol
+#### 3.1 Server Usage
 - Command Line (Recommended)
+  Start the service (the configuration file uses http by default):
  ```bash
-  # start the service
  paddlespeech_server start --config_file ./conf/tts_online_application.yaml
  ```
@@ -67,7 +76,7 @@ Streaming TTS AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc
      log_file="./log/paddlespeech.log")
  ```
-  Output:
+ Output:
  ```bash
  [2022-04-24 21:00:16,934] [    INFO] - The first response time of the 0 warm up: 1.268730878829956 s
  [2022-04-24 21:00:17,046] [    INFO] - The first response time of the 1 warm up: 0.11168622970581055 s
@@ -85,17 +94,15 @@ Streaming TTS AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc
  ```
+#### 3.2 Streaming TTS client Usage
-### 4. Streaming TTS client Usage
 - Command Line (Recommended)
-    ```bash
+    Access http streaming TTS service:
-    # Access http streaming TTS service
-    paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.wav
-    # Access websocket streaming TTS service
+    ```bash
-    paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol websocket --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.wav
+    paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.wav
    ```
    Usage:
    ```bash
@@ -113,7 +120,6 @@ Streaming TTS AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc
    - `sample_rate`: Sampling rate, choices: [0, 8000, 16000], the default is the same as the model. Default: 0
    - `output`: Output wave filepath. Default: None, which means not to save the audio to the local.
    - `play`: Whether to play audio, play while synthesizing, default value: False, which means not playing. **Playing audio needs to rely on the pyaudio library**.
    Output:
    ```bash
@@ -156,8 +162,144 @@ Streaming TTS AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc
  [2022-04-24 21:11:16,802] [    INFO] - 音频时长：3.825 s
  [2022-04-24 21:11:16,802] [    INFO] - RTF: 0.7846773683635238
  [2022-04-24 21:11:16,837] [    INFO] - 音频保存至：./output.wav
+  ```
+### 4. Streaming speech synthesis server and client using websocket protocol
+#### 4.1 Server Usage
+- Command Line (Recommended)
+  First modify the configuration file `conf/tts_online_application.yaml`, **set `protocol` to `websocket`**.
+  Start the service:
+  ```bash
+  paddlespeech_server start --config_file ./conf/tts_online_application.yaml
+  ```
+  Usage:
+  ```bash
+  paddlespeech_server start --help
+  ```
+  Arguments:
+  - `config_file`: yaml file of the app, defalut: ./conf/tts_online_application.yaml
+  - `log_file`: log file. Default: ./log/paddlespeech.log
+  Output:
+  ```bash
+    [2022-04-27 10:18:09,107] [    INFO] - The first response time of the 0 warm up: 1.1551103591918945 s
+    [2022-04-27 10:18:09,219] [    INFO] - The first response time of the 1 warm up: 0.11204338073730469 s
+    [2022-04-27 10:18:09,324] [    INFO] - The first response time of the 2 warm up: 0.1051797866821289 s
+    [2022-04-27 10:18:09,325] [    INFO] - **********************************************************************
+    INFO:     Started server process [17600]
+    [2022-04-27 10:18:09] [INFO] [server.py:75] Started server process [17600]
+    INFO:     Waiting for application startup.
+    [2022-04-27 10:18:09] [INFO] [on.py:45] Waiting for application startup.
+    INFO:     Application startup complete.
+    [2022-04-27 10:18:09] [INFO] [on.py:59] Application startup complete.
+    INFO:     Uvicorn running on http://127.0.0.1:8092 (Press CTRL+C to quit)
+    [2022-04-27 10:18:09] [INFO] [server.py:211] Uvicorn running on http://127.0.0.1:8092 (Press CTRL+C to quit)
+  ```
+- Python API
+  ```python
+  from paddlespeech.server.bin.paddlespeech_server import ServerExecutor
+  server_executor = ServerExecutor()
+  server_executor(
+      config_file="./conf/tts_online_application.yaml", 
+      log_file="./log/paddlespeech.log")
+  ```
+  Output:
+  ```bash
+    [2022-04-27 10:20:16,660] [    INFO] - The first response time of the 0 warm up: 1.0945196151733398 s
+    [2022-04-27 10:20:16,773] [    INFO] - The first response time of the 1 warm up: 0.11222052574157715 s
+    [2022-04-27 10:20:16,878] [    INFO] - The first response time of the 2 warm up: 0.10494542121887207 s
+    [2022-04-27 10:20:16,878] [    INFO] - **********************************************************************
+    INFO:     Started server process [23466]
+    [2022-04-27 10:20:16] [INFO] [server.py:75] Started server process [23466]
+    INFO:     Waiting for application startup.
+    [2022-04-27 10:20:16] [INFO] [on.py:45] Waiting for application startup.
+    INFO:     Application startup complete.
+    [2022-04-27 10:20:16] [INFO] [on.py:59] Application startup complete.
+    INFO:     Uvicorn running on http://127.0.0.1:8092 (Press CTRL+C to quit)
+    [2022-04-27 10:20:16] [INFO] [server.py:211] Uvicorn running on http://127.0.0.1:8092 (Press CTRL+C to quit)
+  ```
+#### 4.2 Streaming TTS client Usage
+- Command Line (Recommended)
+    Access websocket streaming TTS service:
+    ```bash
+    paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol websocket --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.wav
+    ```
+    Usage:
+    ```bash
+    paddlespeech_client tts_online --help
+    ```
+    Arguments:
+    - `server_ip`: erver ip. Default: 127.0.0.1
+    - `port`: server port. Default: 8092
+    - `protocol`: Service protocol, choices: [http, websocket], default: http.
+    - `input`: (required): Input text to generate.
+    - `spk_id`: Speaker id for multi-speaker text to speech. Default: 0
+    - `speed`: Audio speed, the value should be set between 0 and 3. Default: 1.0
+    - `volume`: Audio volume, the value should be set between 0 and 3. Default: 1.0
+    - `sample_rate`: Sampling rate, choices: [0, 8000, 16000], the default is the same as the model. Default: 0
+    - `output`: Output wave filepath. Default: None, which means not to save the audio to the local.
+    - `play`: Whether to play audio, play while synthesizing, default value: False, which means not playing. **Playing audio needs to rely on the pyaudio library**.
+    Output:
+    ```bash
+    [2022-04-27 10:21:04,262] [    INFO] - tts websocket client start
+    [2022-04-27 10:21:04,496] [    INFO] - 句子：您好，欢迎使用百度飞桨语音合成服务。
+    [2022-04-27 10:21:04,496] [    INFO] - 首包响应：0.2124948501586914 s
+    [2022-04-27 10:21:07,483] [    INFO] - 尾包响应：3.199106454849243 s
+    [2022-04-27 10:21:07,484] [    INFO] - 音频时长：3.825 s
+    [2022-04-27 10:21:07,484] [    INFO] - RTF: 0.8363677006141812
+    [2022-04-27 10:21:07,516] [    INFO] - 音频保存至：output.wav
+    ```
+- Python API
+  ```python
+  from paddlespeech.server.bin.paddlespeech_client import TTSOnlineClientExecutor
+  import json
+  executor = TTSOnlineClientExecutor()
+  executor(
+      input="您好，欢迎使用百度飞桨语音合成服务。",
+      server_ip="127.0.0.1",
+      port=8092,
+      protocol="websocket",
+      spk_id=0,
+      speed=1.0,
+      volume=1.0,
+      sample_rate=0,
+      output="./output.wav",
+      play=False)
+  ```
+  Output:
+  ```bash
+    [2022-04-27 10:22:48,852] [    INFO] - tts websocket client start
+    [2022-04-27 10:22:49,080] [    INFO] - 句子：您好，欢迎使用百度飞桨语音合成服务。
+    [2022-04-27 10:22:49,080] [    INFO] - 首包响应：0.21017956733703613 s
+    [2022-04-27 10:22:52,100] [    INFO] - 尾包响应：3.2304444313049316 s
+    [2022-04-27 10:22:52,101] [    INFO] - 音频时长：3.825 s
+    [2022-04-27 10:22:52,101] [    INFO] - RTF: 0.8445606356352762
+    [2022-04-27 10:22:52,134] [    INFO] - 音频保存至：./output.wav
  ```
--- a/demos/streaming_tts_server/README_cn.md
+++ b/demos/streaming_tts_server/README_cn.md
-([简体中文](./README_cn.md)|English)
+(简体中文|[English](./README.md))
 # 流式语音合成服务
@@ -16,17 +16,26 @@
 ### 2. 准备配置文件
 配置文件可参见 `conf/tts_online_application.yaml` 。
-其中，`protocol`表示该流式TTS服务使用的网络协议，目前支持 http 和 websocket 两种。
+- `protocol`表示该流式TTS服务使用的网络协议，目前支持 **http 和 websocket** 两种。
-其中，`engine_list`表示即将启动的服务将会包含的语音引擎，格式为 <语音任务>_<引擎类型>。
+- `engine_list`表示即将启动的服务将会包含的语音引擎，格式为 <语音任务>_<引擎类型>。
-该demo主要介绍流式语音合成服务，因此语音任务应设置为tts。
+    - 该demo主要介绍流式语音合成服务，因此语音任务应设置为tts。
-目前引擎类型支持两种形式：**online** 表示使用python进行动态图推理的引擎；**online-onnx** 表示使用onnxruntime进行推理的引擎。其中，online-onnx的推理速度更快。
+    - 目前引擎类型支持两种形式：**online** 表示使用python进行动态图推理的引擎；**online-onnx** 表示使用onnxruntime进行推理的引擎。其中，online-onnx的推理速度更快。
-流式TTS的AM 模型支持：fastspeech2 以及fastspeech2_cnndecoder; Voc 模型支持：hifigan, mb_melgan
+- 流式TTS引擎的AM模型支持：**fastspeech2 以及fastspeech2_cnndecoder**; Voc 模型支持：**hifigan, mb_melgan**
+- 流式am推理中，每次会对一个chunk的数据进行推理以达到流式的效果。其中`am_block`表示chunk中的有效帧数，`am_pad` 表示一个chunk中am_block前后各加的帧数。am_pad的存在用于消除流式推理产生的误差，避免由流式推理对合成音频质量的影响。
+    - fastspeech2不支持流式am推理，因此am_pad与am_block对它无效
+    - fastspeech2_cnndecoder 支持流式推理，当am_pad=12时，流式推理合成音频与非流式合成音频一致
+- 流式voc推理中，每次会对一个chunk的数据进行推理以达到流式的效果。其中`voc_block`表示chunk中的有效帧数，`voc_pad` 表示一个chunk中voc_block前后各加的帧数。voc_pad的存在用于消除流式推理产生的误差，避免由流式推理对合成音频质量的影响。
+    - hifigan, mb_melgan 均支持流式voc 推理
+    - 当voc模型为mb_melgan，当voc_pad=14时，流式推理合成音频与非流式合成音频一致；voc_pad最小可以设置为7，合成音频听感上没有异常，若voc_pad小于7，合成音频听感上存在异常。
+    - 当voc模型为hifigan，当voc_pad=20时，流式推理合成音频与非流式合成音频一致；当voc_pad=14时，合成音频听感上没有异常。
+- 推理速度：mb_melgan > hifigan; 音频质量：mb_melgan < hifigan
-### 3. 服务端使用方法
+### 3. 使用http协议的流式语音合成服务端及客户端使用方法
+#### 3.1 服务端使用方法
 - 命令行 (推荐使用)
+  启动服务（配置文件默认使用http）：
  ```bash
-  # 启动服务
  paddlespeech_server start --config_file ./conf/tts_online_application.yaml
  ```
@@ -36,7 +45,7 @@
  paddlespeech_server start --help
  ```
  参数:
-  - `config_file`: 服务的配置文件，默认： ./conf/application.yaml
+  - `config_file`: 服务的配置文件，默认： ./conf/tts_online_application.yaml
  - `log_file`: log 文件. 默认：./log/paddlespeech.log
  输出:
@@ -84,17 +93,15 @@
  ```
+#### 3.2 客户端使用方法
-### 4. 流式TTS 客户端使用方法
 - 命令行 (推荐使用)
-    ```bash
+    访问 http 流式TTS服务：
-    # 访问 http 流式TTS服务
-    paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.wav
-    # 访问 websocket 流式TTS服务
+    ```bash
-    paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol websocket --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.wav
+    paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.wav
    ```
    使用帮助:
    ```bash
@@ -155,8 +162,143 @@
  [2022-04-24 21:11:16,802] [    INFO] - 音频时长：3.825 s
  [2022-04-24 21:11:16,802] [    INFO] - RTF: 0.7846773683635238
  [2022-04-24 21:11:16,837] [    INFO] - 音频保存至：./output.wav
+  ```
+### 4. 使用websocket协议的流式语音合成服务端及客户端使用方法
+#### 4.1 服务端使用方法
+- 命令行 (推荐使用)
+  首先修改配置文件 `conf/tts_online_application.yaml`， **将 `protocol` 设置为 `websocket`**。
+  启动服务：
+  ```bash
+  paddlespeech_server start --config_file ./conf/tts_online_application.yaml
+  ```
+  使用方法：
+  ```bash
+  paddlespeech_server start --help
+  ```
+  参数:
+  - `config_file`: 服务的配置文件，默认： ./conf/tts_online_application.yaml
+  - `log_file`: log 文件. 默认：./log/paddlespeech.log
+  输出:
+  ```bash
+    [2022-04-27 10:18:09,107] [    INFO] - The first response time of the 0 warm up: 1.1551103591918945 s
+    [2022-04-27 10:18:09,219] [    INFO] - The first response time of the 1 warm up: 0.11204338073730469 s
+    [2022-04-27 10:18:09,324] [    INFO] - The first response time of the 2 warm up: 0.1051797866821289 s
+    [2022-04-27 10:18:09,325] [    INFO] - **********************************************************************
+    INFO:     Started server process [17600]
+    [2022-04-27 10:18:09] [INFO] [server.py:75] Started server process [17600]
+    INFO:     Waiting for application startup.
+    [2022-04-27 10:18:09] [INFO] [on.py:45] Waiting for application startup.
+    INFO:     Application startup complete.
+    [2022-04-27 10:18:09] [INFO] [on.py:59] Application startup complete.
+    INFO:     Uvicorn running on http://127.0.0.1:8092 (Press CTRL+C to quit)
+    [2022-04-27 10:18:09] [INFO] [server.py:211] Uvicorn running on http://127.0.0.1:8092 (Press CTRL+C to quit)
+  ```
+- Python API
+  ```python
+  from paddlespeech.server.bin.paddlespeech_server import ServerExecutor
+  server_executor = ServerExecutor()
+  server_executor(
+      config_file="./conf/tts_online_application.yaml", 
+      log_file="./log/paddlespeech.log")
+  ```
+  输出：
+  ```bash
+    [2022-04-27 10:20:16,660] [    INFO] - The first response time of the 0 warm up: 1.0945196151733398 s
+    [2022-04-27 10:20:16,773] [    INFO] - The first response time of the 1 warm up: 0.11222052574157715 s
+    [2022-04-27 10:20:16,878] [    INFO] - The first response time of the 2 warm up: 0.10494542121887207 s
+    [2022-04-27 10:20:16,878] [    INFO] - **********************************************************************
+    INFO:     Started server process [23466]
+    [2022-04-27 10:20:16] [INFO] [server.py:75] Started server process [23466]
+    INFO:     Waiting for application startup.
+    [2022-04-27 10:20:16] [INFO] [on.py:45] Waiting for application startup.
+    INFO:     Application startup complete.
+    [2022-04-27 10:20:16] [INFO] [on.py:59] Application startup complete.
+    INFO:     Uvicorn running on http://127.0.0.1:8092 (Press CTRL+C to quit)
+    [2022-04-27 10:20:16] [INFO] [server.py:211] Uvicorn running on http://127.0.0.1:8092 (Press CTRL+C to quit)
+  ```
+#### 4.2 客户端使用方法
+- 命令行 (推荐使用)
+    访问 websocket 流式TTS服务：
+    ```bash
+    paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol websocket --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.wav
+    ```
+    使用帮助:
+    ```bash
+    paddlespeech_client tts_online --help
+    ```
+    参数:
+    - `server_ip`: 服务端ip地址，默认: 127.0.0.1。
+    - `port`: 服务端口，默认: 8092。
+    - `protocol`: 服务协议，可选 [http, websocket], 默认: http。
+    - `input`: (必须输入): 待合成的文本。
+    - `spk_id`: 说话人 id，用于多说话人语音合成，默认值： 0。
+    - `speed`: 音频速度，该值应设置在 0 到 3 之间。 默认值：1.0
+    - `volume`: 音频音量，该值应设置在 0 到 3 之间。 默认值： 1.0
+    - `sample_rate`: 采样率，可选 [0, 8000, 16000]，默认值：0，表示与模型采样率相同
+    - `output`: 输出音频的路径， 默认值：None，表示不保存音频到本地。
+    - `play`: 是否播放音频，边合成边播放， 默认值：False，表示不播放。**播放音频需要依赖pyaudio库**。
+    输出:
+    ```bash
+    [2022-04-27 10:21:04,262] [    INFO] - tts websocket client start
+    [2022-04-27 10:21:04,496] [    INFO] - 句子：您好，欢迎使用百度飞桨语音合成服务。
+    [2022-04-27 10:21:04,496] [    INFO] - 首包响应：0.2124948501586914 s
+    [2022-04-27 10:21:07,483] [    INFO] - 尾包响应：3.199106454849243 s
+    [2022-04-27 10:21:07,484] [    INFO] - 音频时长：3.825 s
+    [2022-04-27 10:21:07,484] [    INFO] - RTF: 0.8363677006141812
+    [2022-04-27 10:21:07,516] [    INFO] - 音频保存至：output.wav
+    ```
+- Python API
+  ```python
+  from paddlespeech.server.bin.paddlespeech_client import TTSOnlineClientExecutor
+  import json
+  executor = TTSOnlineClientExecutor()
+  executor(
+      input="您好，欢迎使用百度飞桨语音合成服务。",
+      server_ip="127.0.0.1",
+      port=8092,
+      protocol="websocket",
+      spk_id=0,
+      speed=1.0,
+      volume=1.0,
+      sample_rate=0,
+      output="./output.wav",
+      play=False)
  ```
+  输出:
+  ```bash
+    [2022-04-27 10:22:48,852] [    INFO] - tts websocket client start
+    [2022-04-27 10:22:49,080] [    INFO] - 句子：您好，欢迎使用百度飞桨语音合成服务。
+    [2022-04-27 10:22:49,080] [    INFO] - 首包响应：0.21017956733703613 s
+    [2022-04-27 10:22:52,100] [    INFO] - 尾包响应：3.2304444313049316 s
+    [2022-04-27 10:22:52,101] [    INFO] - 音频时长：3.825 s
+    [2022-04-27 10:22:52,101] [    INFO] - RTF: 0.8445606356352762
+    [2022-04-27 10:22:52,134] [    INFO] - 音频保存至：./output.wav
+  ```
--- a/demos/streaming_tts_server/conf/tts_online_application.yaml
+++ b/demos/streaming_tts_server/conf/tts_online_application.yaml
-# This is the parameter configuration file for PaddleSpeech Serving.
+# This is the parameter configuration file for streaming tts server.
 #################################################################################
 #                             SERVER SETTING                                    #
@@ -7,8 +7,8 @@ host: 127.0.0.1
 port: 8092
 # The task format in the engin_list is: <speech task>_<engine type>
-# engine_list choices = ['tts_online', 'tts_online-onnx']
+# engine_list choices = ['tts_online', 'tts_online-onnx'], the inference speed of tts_online-onnx is faster than tts_online.
-# protocol = ['websocket', 'http'] (only one can be selected).
+# protocol choices = ['websocket', 'http'] 
 protocol: 'http'
 engine_list: ['tts_online-onnx']
@@ -20,7 +20,8 @@ engine_list: ['tts_online-onnx']
 ################################### TTS #########################################
 ################### speech task: tts; engine_type: online #######################
 tts_online: 
-    # am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc']        
+    # am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc']   
+    # fastspeech2_cnndecoder_csmsc support streaming am infer.     
    am: 'fastspeech2_csmsc'   
    am_config: 
    am_ckpt: 
@@ -31,6 +32,7 @@ tts_online:
    spk_id: 0
    # voc (vocoder) choices=['mb_melgan_csmsc, hifigan_csmsc']
+    # Both mb_melgan_csmsc and hifigan_csmsc support streaming voc inference
    voc: 'mb_melgan_csmsc'
    voc_config: 
    voc_ckpt: 
@@ -39,8 +41,13 @@ tts_online:
    # others
    lang: 'zh'
    device: 'cpu' # set 'gpu:id' or 'cpu'
+    # am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
+    # when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
    am_block: 42
    am_pad: 12
+    # voc_pad and voc_block voc model to streaming voc infer,
+    # when voc model is mb_melgan_csmsc, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
+    # when voc model is hifigan_csmsc, voc_pad set 20, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
    voc_block: 14
    voc_pad: 14
@@ -53,7 +60,8 @@ tts_online:
 ################################### TTS #########################################
 ################### speech task: tts; engine_type: online-onnx #######################
 tts_online-onnx: 
-    # am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx']        
+    # am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx']
+    # fastspeech2_cnndecoder_csmsc_onnx support streaming am infer.        
    am: 'fastspeech2_cnndecoder_csmsc_onnx' 
    # am_ckpt is a list, if am is fastspeech2_cnndecoder_csmsc_onnx, am_ckpt = [encoder model, decoder model, postnet model];
    # if am is fastspeech2_csmsc_onnx, am_ckpt = [ckpt model];
@@ -70,6 +78,7 @@ tts_online-onnx:
        cpu_threads: 4
    # voc (vocoder) choices=['mb_melgan_csmsc_onnx, hifigan_csmsc_onnx']
+    # Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference
    voc: 'hifigan_csmsc_onnx'
    voc_ckpt: 
    voc_sample_rate: 24000
@@ -80,9 +89,15 @@ tts_online-onnx:
    # others
    lang: 'zh'
+    # am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
+    # when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
    am_block: 42
    am_pad: 12
+    # voc_pad and voc_block voc model to streaming voc infer,
+    # when voc model is mb_melgan_csmsc_onnx, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
+    # when voc model is hifigan_csmsc_onnx, voc_pad set 20, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
    voc_block: 14
    voc_pad: 14
+    # voc_upsample should be same as n_shift on voc config.
    voc_upsample: 300
--- a/docs/source/released_model.md
+++ b/docs/source/released_model.md
@@ -6,7 +6,7 @@
 ### Speech Recognition Model
 Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech | Example Link 
 :-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----:  | :-----:  | :-----: 
-[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 345 MB  | 2 Conv + 5 LSTM layers with only forward direction | 0.078 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) 
+[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_fbank161_ckpt_0.2.1.model.tar.gz) | Aishell Dataset | Char-based | 491 MB  | 2 Conv + 5 LSTM layers with only forward direction | 0.0666 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) 
 [Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0) 
 [Conformer Online Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_chunk_conformer_aishell_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 189 MB  | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring| 0.0544 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1) 
 [Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB  | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0464 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1) 

--- a/examples/aishell/asr0/RESULTS.md
+++ b/examples/aishell/asr0/RESULTS.md
@@ -4,6 +4,8 @@
 | Model | Number of Params | Release | Config | Test set | Valid Loss | CER | 
 | --- | --- | --- | --- | --- | --- | --- | 
+| DeepSpeech2 | 45.18M | r0.2.0 | conf/deepspeech2_online.yaml + U2 Data pipline and spec aug + fbank161 | test | 6.876979827880859 | 0.0666 |
+| DeepSpeech2 | 45.18M | r0.2.0 | conf/deepspeech2_online.yaml + spec aug + fbank161 | test | 7.679287910461426 | 0.0718 |
 | DeepSpeech2 | 45.18M | r0.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.708217620849609| 0.078 |
 | DeepSpeech2 | 45.18M | v2.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.994938373565674 | 0.080 |  

--- a/examples/ami/README.md
+++ b/examples/ami/README.md
 # Speaker Diarization on AMI corpus
-* sd0 - speaker diarization by AHC,SC base on x-vectors
+* sd0 - speaker diarization by AHC,SC base on embeddings
--- a/examples/ami/sd0/README.md
+++ b/examples/ami/sd0/README.md
@@ -7,7 +7,23 @@
 The script performs diarization using x-vectors(TDNN,ECAPA-TDNN) on the AMI mix-headset data. We demonstrate the use of different clustering methods: AHC, spectral.
 ## How to Run
+### prepare annotations and audios
+Download AMI corpus, You need around 10GB of free space to get whole data
+The signals are too large to package in this way, so you need to use the chooser to indicate which ones you wish to download
+```bash
+## download  annotations
+wget http://groups.inf.ed.ac.uk/ami/AMICorpusAnnotations/ami_public_manual_1.6.2.zip && unzip ami_public_manual_1.6.2.zip
+```
+then please follow https://groups.inf.ed.ac.uk/ami/download/ to download the Signals:
+1) Select one or more AMI meetings: the IDs please follow ./ami_split.py
+2) Select media streams: Just select Headset mix
+### start running
 Use the following command to run diarization on AMI corpus.
-`bash ./run.sh` 
+```bash
+./run.sh  --data_folder ./amicorpus  --manual_annot_folder ./ami_public_manual_1.6.2
+```
 ## Results (DER) coming soon! :)
--- a/examples/ami/sd0/run.sh
+++ b/examples/ami/sd0/run.sh
@@ -17,18 +17,6 @@ device=gpu
 . ${MAIN_ROOT}/utils/parse_options.sh || exit 1;
-if [ $stage -le 0 ]; then
-    # Prepare data
-    # Download AMI corpus, You need around 10GB of free space to get whole data
-    # The signals are too large to package in this way,
-    # so you need to use the chooser to indicate which ones you wish to download
-    echo "Please follow https://groups.inf.ed.ac.uk/ami/download/ to download the data."
-    echo "Annotations: AMI manual annotations v1.6.2 "
-    echo "Signals: "
-    echo "1) Select one or more AMI meetings: the IDs please follow ./ami_split.py"
-    echo "2) Select media streams: Just select Headset mix"
-fi
 if [ $stage -le 1 ]; then
    # Download the pretrained model
    wget https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz

--- a/examples/hey_snips/kws0/conf/mdtc.yaml
+++ b/examples/hey_snips/kws0/conf/mdtc.yaml
-data:
+# https://yaml.org/type/float.html
-  data_dir: '/PATH/TO/DATA/hey_snips_research_6k_en_train_eval_clean_ter'
+###########################################
-  dataset: 'paddleaudio.datasets:HeySnips'
+#                   Data                  #
+###########################################
+dataset: 'paddleaudio.datasets:HeySnips'
+data_dir: '/PATH/TO/DATA/hey_snips_research_6k_en_train_eval_clean_ter'
-model:
+############################################
-  num_keywords: 1
+#           Network Architecture           #
-  backbone: 'paddlespeech.kws.models:MDTC'
+############################################
-  config:
+backbone: 'paddlespeech.kws.models:MDTC'
-    stack_num: 3
+num_keywords: 1
-    stack_size: 4
+stack_num: 3
-    in_channels: 80
+stack_size: 4
-    res_channels: 32
+in_channels: 80
-    kernel_size: 5
+res_channels: 32
+kernel_size: 5
-feature:
+###########################################
-  feat_type: 'kaldi_fbank'
+#                Feature                  #
-  sample_rate: 16000
+###########################################
-  frame_shift: 10
+feat_type: 'kaldi_fbank'
-  frame_length: 25
+sample_rate: 16000
-  n_mels: 80
+frame_shift: 10
+frame_length: 25
+n_mels: 80
-training:
+###########################################
-  epochs: 100
+#                Training                 #
-  num_workers: 16
+###########################################
-  batch_size: 100
+epochs: 100
-  checkpoint_dir: './checkpoint'
+num_workers: 16
-  save_freq: 10
+batch_size: 100
-  log_freq: 10
+checkpoint_dir: './checkpoint'
-  learning_rate: 0.001
+save_freq: 10
-  weight_decay: 0.00005
+log_freq: 10
-  grad_clip: 5.0
+learning_rate: 0.001
+weight_decay: 0.00005
+grad_clip: 5.0
-scoring:
+###########################################
-  batch_size: 100
+#                Scoring                  #
-  num_workers: 16
+###########################################
-  checkpoint: './checkpoint/epoch_100/model.pdparams'
+batch_size: 100
-  score_file: './scores.txt'
+num_workers: 16
-  stats_file: './stats.0.txt'
+checkpoint: './checkpoint/epoch_100/model.pdparams'
-  img_file: './det.png'
+score_file: './scores.txt'
\ No newline at end of file
+stats_file: './stats.0.txt'
+img_file: './det.png'
\ No newline at end of file
--- a/examples/hey_snips/kws0/local/plot.sh
+++ b/examples/hey_snips/kws0/local/plot.sh
 #!/bin/bash
-python3 ${BIN_DIR}/plot_det_curve.py --cfg_path=$1 --keyword HeySnips
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+if [ $# != 3 ];then
+    echo "usage: ${0} config_path checkpoint output_file"
+    exit -1
+fi
+keyword=$1
+stats_file=$2
+img_file=$3
+python3 ${BIN_DIR}/plot_det_curve.py --keyword_label ${keyword} --stats_file ${stats_file} --img_file ${img_file}
--- a/examples/hey_snips/kws0/local/score.sh
+++ b/examples/hey_snips/kws0/local/score.sh
 #!/bin/bash
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
-python3 ${BIN_DIR}/score.py --cfg_path=$1
+if [ $# != 4 ];then
+    echo "usage: ${0} checkpoint score_file stats_file"
+    exit -1
+fi
-python3 ${BIN_DIR}/compute_det.py --cfg_path=$1
+cfg_path=$1
+ckpt=$2
+score_file=$3
+stats_file=$4
+python3 ${BIN_DIR}/score.py --config ${cfg_path} --ckpt ${ckpt} --score_file ${score_file} || exit -1
+python3 ${BIN_DIR}/compute_det.py --config ${cfg_path} --score_file ${score_file} --stats_file ${stats_file} || exit -1
--- a/examples/hey_snips/kws0/local/train.sh
+++ b/examples/hey_snips/kws0/local/train.sh
 #!/bin/bash
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+if [ $# != 2 ];then
+    echo "usage: ${0} num_gpus config_path"
+    exit -1
+fi
 ngpu=$1
 cfg_path=$2
 if [ ${ngpu} -gt 0 ]; then
    python3 -m paddle.distributed.launch --gpus $CUDA_VISIBLE_DEVICES ${BIN_DIR}/train.py \
-    --cfg_path ${cfg_path}
+    --config ${cfg_path}
 else
    echo "set CUDA_VISIBLE_DEVICES to enable multi-gpus trainning."
    python3 ${BIN_DIR}/train.py \
-    --cfg_path ${cfg_path}
+    --config ${cfg_path}
 fi
--- a/examples/hey_snips/kws0/run.sh
+++ b/examples/hey_snips/kws0/run.sh
@@ -32,10 +32,16 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    ./local/train.sh ${ngpu} ${cfg_path} || exit -1
 fi
+ckpt=./checkpoint/epoch_100/model.pdparams
+score_file=./scores.txt
+stats_file=./stats.0.txt
+img_file=./det.png
 if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
-    ./local/score.sh ${cfg_path} || exit -1
+    ./local/score.sh ${cfg_path} ${ckpt} ${score_file} ${stats_file} || exit -1
 fi
+keyword=HeySnips
 if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
-    ./local/plot.sh ${cfg_path} || exit -1
+    ./local/plot.sh ${keyword} ${stats_file} ${img_file} || exit -1
 fi
\ No newline at end of file
--- a/examples/voxceleb/sv0/README.md
+++ b/examples/voxceleb/sv0/README.md
@@ -142,10 +142,10 @@ using the `tar` scripts to unpack the model and then you can use the script to t
 For example:
 ```
 wget https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz
-tar xzvf sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz
+tar -xvf sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz
 source path.sh
 # If you have processed the data and get the manifest file， you can skip the following 2 steps
-CUDA_VISIBLE_DEVICES= ./local/test.sh ./data sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_2 conf/ecapa_tdnn.yaml
+CUDA_VISIBLE_DEVICES= bash ./local/test.sh ./data sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_2/model/ conf/ecapa_tdnn.yaml
 ```
 The performance of the released models are shown in [this](./RESULTS.md)
--- a/examples/voxceleb/sv0/local/test.sh
+++ b/examples/voxceleb/sv0/local/test.sh
@@ -33,10 +33,26 @@ dir=$1
 exp_dir=$2
 conf_path=$3
+# get the gpu nums for training
+ngpu=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
+echo "using $ngpu gpus..."
+# setting training device
+device="cpu"
+if ${use_gpu}; then
+    device="gpu"
+fi
+if [ $ngpu -le 0 ]; then 
+    echo "no gpu, training in cpu mode"
+    device='cpu'
+    use_gpu=false
+fi
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
   # test the model and compute the eer metrics
   python3 ${BIN_DIR}/test.py \
         --data-dir ${dir} \
         --load-checkpoint ${exp_dir} \
-         --config ${conf_path}
+         --config ${conf_path} \
+         --device ${device}
 fi
--- a/examples/voxceleb/sv0/local/train.sh
+++ b/examples/voxceleb/sv0/local/train.sh
@@ -42,15 +42,25 @@ device="cpu"
 if ${use_gpu}; then
    device="gpu"
 fi
+if [ $ngpu -le 0 ]; then 
+    echo "no gpu, training in cpu mode"
+    device='cpu'
+    use_gpu=false
+fi
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    # train the speaker identification task with voxceleb data
    # and we will create the trained model parameters in ${exp_dir}/model.pdparams as the soft link
    # Note: we will store the log file in exp/log directory
-    python3 -m paddle.distributed.launch --gpus=$CUDA_VISIBLE_DEVICES \
+    if $use_gpu; then
-        ${BIN_DIR}/train.py --device ${device} --checkpoint-dir ${exp_dir} \
+        python3 -m paddle.distributed.launch --gpus=$CUDA_VISIBLE_DEVICES \
-        --data-dir ${dir} --config ${conf_path}
+            ${BIN_DIR}/train.py --device ${device} --checkpoint-dir ${exp_dir} \
+            --data-dir ${dir} --config ${conf_path}
+    else
+        python3 \
+            ${BIN_DIR}/train.py --device ${device} --checkpoint-dir ${exp_dir} \
+            --data-dir ${dir} --config ${conf_path}
+    fi
 fi 
 if [ $? -ne 0 ]; then

--- a/paddlespeech/cli/asr/pretrained_models.py
+++ b/paddlespeech/cli/asr/pretrained_models.py
@@ -27,6 +27,16 @@ pretrained_models = {
        'ckpt_path':
        'exp/conformer/checkpoints/wenetspeech',
    },
+    "conformer_online_multicn-zh-16k": {
+        'url':
+        'https://paddlespeech.bj.bcebos.com/s2t/multi_cn/asr1/asr1_chunk_conformer_multi_cn_ckpt_0.2.0.model.tar.gz',
+        'md5':
+        '7989b3248c898070904cf042fd656003',
+        'cfg_path':
+        'model.yaml',
+        'ckpt_path':
+        'exp/chunk_conformer/checkpoints/multi_cn',
+    },
    "conformer_aishell-zh-16k": {
        'url':
        'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.2.model.tar.gz',
@@ -57,6 +67,20 @@ pretrained_models = {
        'ckpt_path':
        'exp/transformer/checkpoints/avg_10',
    },
+    "deepspeech2online_wenetspeech-zh-16k": {
+        'url':
+        'https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr0/WIP_asr0_deepspeech2_online_wenetspeech_ckpt_1.0.0a.model.tar.gz',
+        'md5':
+        'b3ef6fcae8c0058c3c53375341ccb209',
+        'cfg_path':
+        'model.yaml',
+        'ckpt_path':
+        'exp/deepspeech2_online/checkpoints/avg_3',
+        'lm_url':
+        'https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm',
+        'lm_md5':
+        '29e02312deb2e59b3c8686c7966d4fe3'
+    },
    "deepspeech2offline_aishell-zh-16k": {
        'url':
        'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz',
@@ -73,9 +97,9 @@ pretrained_models = {
    },
    "deepspeech2online_aishell-zh-16k": {
        'url':
-        'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz',
+        'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_fbank161_ckpt_0.2.1.model.tar.gz',
        'md5':
-        '23e16c69730a1cb5d735c98c83c21e16',
+        '98b87b171b7240b7cae6e07d8d0bc9be',
        'cfg_path':
        'model.yaml',
        'ckpt_path':

--- a/paddlespeech/cli/vector/infer.py
+++ b/paddlespeech/cli/vector/infer.py
@@ -22,6 +22,8 @@ from typing import Union
 import paddle
 import soundfile
+from paddleaudio.backends import load as load_audio
+from paddleaudio.compliance.librosa import melspectrogram
 from yacs.config import CfgNode
 from ..executor import BaseExecutor
@@ -30,8 +32,6 @@ from ..utils import cli_register
 from ..utils import stats_wrapper
 from .pretrained_models import model_alias
 from .pretrained_models import pretrained_models
-from paddleaudio.backends import load as load_audio
-from paddleaudio.compliance.librosa import melspectrogram
 from paddlespeech.s2t.utils.dynamic_import import dynamic_import
 from paddlespeech.vector.io.batch import feature_normalize
 from paddlespeech.vector.modules.sid_model import SpeakerIdetification

--- a/paddlespeech/kws/exps/mdtc/compute_det.py
+++ b/paddlespeech/kws/exps/mdtc/compute_det.py
@@ -12,24 +12,15 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # Modified from wekws(https://github.com/wenet-e2e/wekws)
-import argparse
 import os
 import paddle
-import yaml
 from tqdm import tqdm
+from yacs.config import CfgNode
+from paddlespeech.s2t.training.cli import default_argument_parser
 from paddlespeech.s2t.utils.dynamic_import import dynamic_import
-# yapf: disable
-parser = argparse.ArgumentParser(__doc__)
-parser.add_argument("--cfg_path", type=str, required=True)
-parser.add_argument('--keyword_index', type=int, default=0, help='keyword index')
-parser.add_argument('--step', type=float, default=0.01, help='threshold step of trigger score')
-parser.add_argument('--window_shift', type=int, default=50, help='window_shift is used to skip the frames after triggered')
-args = parser.parse_args()
-# yapf: enable
 def load_label_and_score(keyword_index: int,
                         ds: paddle.io.Dataset,
@@ -61,26 +52,52 @@ def load_label_and_score(keyword_index: int,
 if __name__ == '__main__':
-    args.cfg_path = os.path.abspath(os.path.expanduser(args.cfg_path))
+    parser = default_argument_parser()
-    with open(args.cfg_path, 'r') as f:
+    parser.add_argument(
-        config = yaml.safe_load(f)
+        '--keyword_index', type=int, default=0, help='keyword index')
+    parser.add_argument(
+        '--step',
+        type=float,
+        default=0.01,
+        help='threshold step of trigger score')
+    parser.add_argument(
+        '--window_shift',
+        type=int,
+        default=50,
+        help='window_shift is used to skip the frames after triggered')
+    parser.add_argument(
+        "--score_file",
+        type=str,
+        required=True,
+        help='output file of trigger scores')
+    parser.add_argument(
+        '--stats_file',
+        type=str,
+        default='./stats.0.txt',
+        help='output file of detection error tradeoff')
+    args = parser.parse_args()
-    data_conf = config['data']
+    # https://yaml.org/type/float.html
-    feat_conf = config['feature']
+    config = CfgNode(new_allowed=True)
-    scoring_conf = config['scoring']
+    if args.config:
+        config.merge_from_file(args.config)
    # Dataset
-    ds_class = dynamic_import(data_conf['dataset'])
+    ds_class = dynamic_import(config['dataset'])
-    test_ds = ds_class(data_dir=data_conf['data_dir'], mode='test', **feat_conf)
+    test_ds = ds_class(
+        data_dir=config['data_dir'],
-    score_file = os.path.abspath(scoring_conf['score_file'])
+        mode='test',
-    stats_file = os.path.abspath(scoring_conf['stats_file'])
+        feat_type=config['feat_type'],
+        sample_rate=config['sample_rate'],
+        frame_shift=config['frame_shift'],
+        frame_length=config['frame_length'],
+        n_mels=config['n_mels'], )
    keyword_table, filler_table, filler_duration = load_label_and_score(
-        args.keyword, test_ds, score_file)
+        args.keyword_index, test_ds, args.score_file)
    print('Filler total duration Hours: {}'.format(filler_duration / 3600.0))
    pbar = tqdm(total=int(1.0 / args.step))
-    with open(stats_file, 'w', encoding='utf8') as fout:
+    with open(args.stats_file, 'w', encoding='utf8') as fout:
        keyword_index = args.keyword_index
        threshold = 0.0
        while threshold <= 1.0:
@@ -113,4 +130,4 @@ if __name__ == '__main__':
            pbar.update(1)
    pbar.close()
-    print('DET saved to: {}'.format(stats_file))
+    print('DET saved to: {}'.format(args.stats_file))
--- a/paddlespeech/kws/exps/mdtc/plot_det_curve.py
+++ b/paddlespeech/kws/exps/mdtc/plot_det_curve.py
@@ -17,12 +17,12 @@ import os
 import matplotlib.pyplot as plt
 import numpy as np
-import yaml
 # yapf: disable
 parser = argparse.ArgumentParser(__doc__)
-parser.add_argument("--cfg_path", type=str, required=True)
+parser.add_argument('--keyword_label', type=str, required=True, help='keyword string shown on image')
-parser.add_argument("--keyword", type=str, required=True)
+parser.add_argument('--stats_file', type=str, required=True, help='output file of detection error tradeoff')
+parser.add_argument('--img_file', type=str, default='./det.png', help='output det image')
 args = parser.parse_args()
 # yapf: enable
@@ -61,14 +61,8 @@ def plot_det_curve(keywords, stats_file, figure_file, xlim, x_step, ylim,
 if __name__ == '__main__':
-    args.cfg_path = os.path.abspath(os.path.expanduser(args.cfg_path))
+    img_file = os.path.abspath(args.img_file)
-    with open(args.cfg_path, 'r') as f:
+    stats_file = os.path.abspath(args.stats_file)
-        config = yaml.safe_load(f)
+    plot_det_curve([args.keyword_label], stats_file, img_file, 10, 2, 10, 2)
-    scoring_conf = config['scoring']
-    img_file = os.path.abspath(scoring_conf['img_file'])
-    stats_file = os.path.abspath(scoring_conf['stats_file'])
-    keywords = [args.keyword]
-    plot_det_curve(keywords, stats_file, img_file, 10, 2, 10, 2)
    print('DET curve image saved to: {}'.format(img_file))
--- a/paddlespeech/kws/exps/mdtc/score.py
+++ b/paddlespeech/kws/exps/mdtc/score.py
@@ -12,55 +12,67 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # Modified from wekws(https://github.com/wenet-e2e/wekws)
-import argparse
-import os
 import paddle
-import yaml
 from tqdm import tqdm
+from yacs.config import CfgNode
 from paddlespeech.kws.exps.mdtc.collate import collate_features
 from paddlespeech.kws.models.mdtc import KWSModel
+from paddlespeech.s2t.training.cli import default_argument_parser
 from paddlespeech.s2t.utils.dynamic_import import dynamic_import
-# yapf: disable
-parser = argparse.ArgumentParser(__doc__)
-parser.add_argument("--cfg_path", type=str, required=True)
-args = parser.parse_args()
-# yapf: enable
 if __name__ == '__main__':
-    args.cfg_path = os.path.abspath(os.path.expanduser(args.cfg_path))
+    parser = default_argument_parser()
-    with open(args.cfg_path, 'r') as f:
+    parser.add_argument(
-        config = yaml.safe_load(f)
+        "--ckpt",
+        type=str,
+        required=True,
+        help='model checkpoint for evaluation.')
+    parser.add_argument(
+        "--score_file",
+        type=str,
+        default='./scores.txt',
+        help='output file of trigger scores')
+    args = parser.parse_args()
-    model_conf = config['model']
+    # https://yaml.org/type/float.html
-    data_conf = config['data']
+    config = CfgNode(new_allowed=True)
-    feat_conf = config['feature']
+    if args.config:
-    scoring_conf = config['scoring']
+        config.merge_from_file(args.config)
    # Dataset
-    ds_class = dynamic_import(data_conf['dataset'])
+    ds_class = dynamic_import(config['dataset'])
-    test_ds = ds_class(data_dir=data_conf['data_dir'], mode='test', **feat_conf)
+    test_ds = ds_class(
+        data_dir=config['data_dir'],
+        mode='test',
+        feat_type=config['feat_type'],
+        sample_rate=config['sample_rate'],
+        frame_shift=config['frame_shift'],
+        frame_length=config['frame_length'],
+        n_mels=config['n_mels'], )
    test_sampler = paddle.io.BatchSampler(
-        test_ds, batch_size=scoring_conf['batch_size'], drop_last=False)
+        test_ds, batch_size=config['batch_size'], drop_last=False)
    test_loader = paddle.io.DataLoader(
        test_ds,
        batch_sampler=test_sampler,
-        num_workers=scoring_conf['num_workers'],
+        num_workers=config['num_workers'],
        return_list=True,
        use_buffer_reader=True,
        collate_fn=collate_features, )
    # Model
-    backbone_class = dynamic_import(model_conf['backbone'])
+    backbone_class = dynamic_import(config['backbone'])
-    backbone = backbone_class(**model_conf['config'])
+    backbone = backbone_class(
-    model = KWSModel(backbone=backbone, num_keywords=model_conf['num_keywords'])
+        stack_num=config['stack_num'],
-    model.set_state_dict(paddle.load(scoring_conf['checkpoint']))
+        stack_size=config['stack_size'],
+        in_channels=config['in_channels'],
+        res_channels=config['res_channels'],
+        kernel_size=config['kernel_size'], )
+    model = KWSModel(backbone=backbone, num_keywords=config['num_keywords'])
+    model.set_state_dict(paddle.load(args.ckpt))
    model.eval()
-    with paddle.no_grad(), open(
+    with paddle.no_grad(), open(args.score_file, 'w', encoding='utf8') as f:
-            scoring_conf['score_file'], 'w', encoding='utf8') as fout:
        for batch_idx, batch in enumerate(
                tqdm(test_loader, total=len(test_loader))):
            keys, feats, labels, lengths = batch
@@ -73,7 +85,6 @@ if __name__ == '__main__':
                    keyword_scores = score[:, keyword_i]
                    score_frames = ' '.join(
                        ['{:.6f}'.format(x) for x in keyword_scores.tolist()])
-                    fout.write(
+                    f.write('{} {} {}\n'.format(key, keyword_i, score_frames))
-                        '{} {} {}\n'.format(key, keyword_i, score_frames))
-    print('Result saved to: {}'.format(scoring_conf['score_file']))
+    print('Result saved to: {}'.format(args.score_file))
--- a/paddlespeech/kws/exps/mdtc/train.py
+++ b/paddlespeech/kws/exps/mdtc/train.py
@@ -11,77 +11,88 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-import argparse
 import os
 import paddle
-import yaml
 from paddleaudio.utils import logger
 from paddleaudio.utils import Timer
+from yacs.config import CfgNode
 from paddlespeech.kws.exps.mdtc.collate import collate_features
 from paddlespeech.kws.models.loss import max_pooling_loss
 from paddlespeech.kws.models.mdtc import KWSModel
+from paddlespeech.s2t.training.cli import default_argument_parser
 from paddlespeech.s2t.utils.dynamic_import import dynamic_import
-# yapf: disable
-parser = argparse.ArgumentParser(__doc__)
-parser.add_argument("--cfg_path", type=str, required=True)
-args = parser.parse_args()
-# yapf: enable
 if __name__ == '__main__':
+    parser = default_argument_parser()
+    args = parser.parse_args()
+    # https://yaml.org/type/float.html
+    config = CfgNode(new_allowed=True)
+    if args.config:
+        config.merge_from_file(args.config)
    nranks = paddle.distributed.get_world_size()
    if paddle.distributed.get_world_size() > 1:
        paddle.distributed.init_parallel_env()
    local_rank = paddle.distributed.get_rank()
-    args.cfg_path = os.path.abspath(os.path.expanduser(args.cfg_path))
-    with open(args.cfg_path, 'r') as f:
-        config = yaml.safe_load(f)
-    model_conf = config['model']
-    data_conf = config['data']
-    feat_conf = config['feature']
-    training_conf = config['training']
    # Dataset
-    ds_class = dynamic_import(data_conf['dataset'])
+    ds_class = dynamic_import(config['dataset'])
    train_ds = ds_class(
-        data_dir=data_conf['data_dir'], mode='train', **feat_conf)
+        data_dir=config['data_dir'],
-    dev_ds = ds_class(data_dir=data_conf['data_dir'], mode='dev', **feat_conf)
+        mode='train',
+        feat_type=config['feat_type'],
+        sample_rate=config['sample_rate'],
+        frame_shift=config['frame_shift'],
+        frame_length=config['frame_length'],
+        n_mels=config['n_mels'], )
+    dev_ds = ds_class(
+        data_dir=config['data_dir'],
+        mode='dev',
+        feat_type=config['feat_type'],
+        sample_rate=config['sample_rate'],
+        frame_shift=config['frame_shift'],
+        frame_length=config['frame_length'],
+        n_mels=config['n_mels'], )
    train_sampler = paddle.io.DistributedBatchSampler(
        train_ds,
-        batch_size=training_conf['batch_size'],
+        batch_size=config['batch_size'],
        shuffle=True,
        drop_last=False)
    train_loader = paddle.io.DataLoader(
        train_ds,
        batch_sampler=train_sampler,
-        num_workers=training_conf['num_workers'],
+        num_workers=config['num_workers'],
        return_list=True,
        use_buffer_reader=True,
        collate_fn=collate_features, )
    # Model
-    backbone_class = dynamic_import(model_conf['backbone'])
+    backbone_class = dynamic_import(config['backbone'])
-    backbone = backbone_class(**model_conf['config'])
+    backbone = backbone_class(
-    model = KWSModel(backbone=backbone, num_keywords=model_conf['num_keywords'])
+        stack_num=config['stack_num'],
+        stack_size=config['stack_size'],
+        in_channels=config['in_channels'],
+        res_channels=config['res_channels'],
+        kernel_size=config['kernel_size'], )
+    model = KWSModel(backbone=backbone, num_keywords=config['num_keywords'])
    model = paddle.DataParallel(model)
-    clip = paddle.nn.ClipGradByGlobalNorm(training_conf['grad_clip'])
+    clip = paddle.nn.ClipGradByGlobalNorm(config['grad_clip'])
    optimizer = paddle.optimizer.Adam(
-        learning_rate=training_conf['learning_rate'],
+        learning_rate=config['learning_rate'],
-        weight_decay=training_conf['weight_decay'],
+        weight_decay=config['weight_decay'],
        parameters=model.parameters(),
        grad_clip=clip)
    criterion = max_pooling_loss
    steps_per_epoch = len(train_sampler)
-    timer = Timer(steps_per_epoch * training_conf['epochs'])
+    timer = Timer(steps_per_epoch * config['epochs'])
    timer.start()
-    for epoch in range(1, training_conf['epochs'] + 1):
+    for epoch in range(1, config['epochs'] + 1):
        model.train()
        avg_loss = 0
@@ -107,15 +118,13 @@ if __name__ == '__main__':
            timer.count()
-            if (batch_idx + 1
+            if (batch_idx + 1) % config['log_freq'] == 0 and local_rank == 0:
-                ) % training_conf['log_freq'] == 0 and local_rank == 0:
                lr = optimizer.get_lr()
-                avg_loss /= training_conf['log_freq']
+                avg_loss /= config['log_freq']
                avg_acc = num_corrects / num_samples
                print_msg = 'Epoch={}/{}, Step={}/{}'.format(
-                    epoch, training_conf['epochs'], batch_idx + 1,
+                    epoch, config['epochs'], batch_idx + 1, steps_per_epoch)
-                    steps_per_epoch)
                print_msg += ' loss={:.4f}'.format(avg_loss)
                print_msg += ' acc={:.4f}'.format(avg_acc)
                print_msg += ' lr={:.6f} step/sec={:.2f} | ETA {}'.format(
@@ -126,17 +135,17 @@ if __name__ == '__main__':
                num_corrects = 0
                num_samples = 0
-        if epoch % training_conf[
+        if epoch % config[
                'save_freq'] == 0 and batch_idx + 1 == steps_per_epoch and local_rank == 0:
            dev_sampler = paddle.io.BatchSampler(
                dev_ds,
-                batch_size=training_conf['batch_size'],
+                batch_size=config['batch_size'],
                shuffle=False,
                drop_last=False)
            dev_loader = paddle.io.DataLoader(
                dev_ds,
                batch_sampler=dev_sampler,
-                num_workers=training_conf['num_workers'],
+                num_workers=config['num_workers'],
                return_list=True,
                use_buffer_reader=True,
                collate_fn=collate_features, )
@@ -159,7 +168,7 @@ if __name__ == '__main__':
            logger.eval(print_msg)
            # Save model
-            save_dir = os.path.join(training_conf['checkpoint_dir'],
+            save_dir = os.path.join(config['checkpoint_dir'],
                                    'epoch_{}'.format(epoch))
            logger.info('Saving model checkpoint to {}'.format(save_dir))
            paddle.save(model.state_dict(),

--- a/paddlespeech/s2t/__init__.py
+++ b/paddlespeech/s2t/__init__.py
@@ -131,12 +131,14 @@ if not hasattr(paddle.Tensor, 'long'):
        "override long of paddle.Tensor if exists or register, remove this when fixed!"
    )
    paddle.Tensor.long = func_long
+    paddle.static.Variable.long = func_long
 if not hasattr(paddle.Tensor, 'numel'):
    logger.debug(
        "override numel of paddle.Tensor if exists or register, remove this when fixed!"
    )
    paddle.Tensor.numel = paddle.numel
+    paddle.static.Variable.numel = paddle.numel
 def new_full(x: paddle.Tensor,
@@ -151,6 +153,7 @@ if not hasattr(paddle.Tensor, 'new_full'):
        "override new_full of paddle.Tensor if exists or register, remove this when fixed!"
    )
    paddle.Tensor.new_full = new_full
+    paddle.static.Variable.new_full = new_full
 def eq(xs: paddle.Tensor, ys: Union[paddle.Tensor, float]) -> paddle.Tensor:
@@ -166,6 +169,7 @@ if not hasattr(paddle.Tensor, 'eq'):
        "override eq of paddle.Tensor if exists or register, remove this when fixed!"
    )
    paddle.Tensor.eq = eq
+    paddle.static.Variable.eq = eq
 if not hasattr(paddle, 'eq'):
    logger.debug(
@@ -182,6 +186,7 @@ if not hasattr(paddle.Tensor, 'contiguous'):
        "override contiguous of paddle.Tensor if exists or register, remove this when fixed!"
    )
    paddle.Tensor.contiguous = contiguous
+    paddle.static.Variable.contiguous = contiguous
 def size(xs: paddle.Tensor, *args: int) -> paddle.Tensor:
@@ -200,6 +205,7 @@ logger.debug(
    "(`to_static` do not process `size` property, maybe some `paddle` api dependent on it), remove this when fixed!"
 )
 paddle.Tensor.size = size
+paddle.static.Variable.size = size
 def view(xs: paddle.Tensor, *args: int) -> paddle.Tensor:
@@ -209,6 +215,7 @@ def view(xs: paddle.Tensor, *args: int) -> paddle.Tensor:
 if not hasattr(paddle.Tensor, 'view'):
    logger.debug("register user view to paddle.Tensor, remove this when fixed!")
    paddle.Tensor.view = view
+    paddle.static.Variable.view = view
 def view_as(xs: paddle.Tensor, ys: paddle.Tensor) -> paddle.Tensor:
@@ -219,6 +226,7 @@ if not hasattr(paddle.Tensor, 'view_as'):
    logger.debug(
        "register user view_as to paddle.Tensor, remove this when fixed!")
    paddle.Tensor.view_as = view_as
+    paddle.static.Variable.view_as = view_as
 def is_broadcastable(shp1, shp2):
@@ -246,6 +254,7 @@ if not hasattr(paddle.Tensor, 'masked_fill'):
    logger.debug(
        "register user masked_fill to paddle.Tensor, remove this when fixed!")
    paddle.Tensor.masked_fill = masked_fill
+    paddle.static.Variable.masked_fill = masked_fill
 def masked_fill_(xs: paddle.Tensor,
@@ -264,6 +273,7 @@ if not hasattr(paddle.Tensor, 'masked_fill_'):
    logger.debug(
        "register user masked_fill_ to paddle.Tensor, remove this when fixed!")
    paddle.Tensor.masked_fill_ = masked_fill_
+    paddle.static.Variable.maksed_fill_ = masked_fill_
 def fill_(xs: paddle.Tensor, value: Union[float, int]) -> paddle.Tensor:
@@ -276,6 +286,7 @@ if not hasattr(paddle.Tensor, 'fill_'):
    logger.debug(
        "register user fill_ to paddle.Tensor, remove this when fixed!")
    paddle.Tensor.fill_ = fill_
+    paddle.static.Variable.fill_ = fill_
 def repeat(xs: paddle.Tensor, *size: Any) -> paddle.Tensor:
@@ -286,6 +297,7 @@ if not hasattr(paddle.Tensor, 'repeat'):
    logger.debug(
        "register user repeat to paddle.Tensor, remove this when fixed!")
    paddle.Tensor.repeat = repeat
+    paddle.static.Variable.repeat = repeat
 if not hasattr(paddle.Tensor, 'softmax'):
    logger.debug(
@@ -310,6 +322,7 @@ if not hasattr(paddle.Tensor, 'type_as'):
    logger.debug(
        "register user type_as to paddle.Tensor, remove this when fixed!")
    setattr(paddle.Tensor, 'type_as', type_as)
+    setattr(paddle.static.Variable, 'type_as', type_as)
 def to(x: paddle.Tensor, *args, **kwargs) -> paddle.Tensor:
@@ -325,6 +338,7 @@ def to(x: paddle.Tensor, *args, **kwargs) -> paddle.Tensor:
 if not hasattr(paddle.Tensor, 'to'):
    logger.debug("register user to to paddle.Tensor, remove this when fixed!")
    setattr(paddle.Tensor, 'to', to)
+    setattr(paddle.static.Variable, 'to', to)
 def func_float(x: paddle.Tensor) -> paddle.Tensor:
@@ -335,6 +349,7 @@ if not hasattr(paddle.Tensor, 'float'):
    logger.debug(
        "register user float to paddle.Tensor, remove this when fixed!")
    setattr(paddle.Tensor, 'float', func_float)
+    setattr(paddle.static.Variable, 'float', func_float)
 def func_int(x: paddle.Tensor) -> paddle.Tensor:
@@ -344,6 +359,7 @@ def func_int(x: paddle.Tensor) -> paddle.Tensor:
 if not hasattr(paddle.Tensor, 'int'):
    logger.debug("register user int to paddle.Tensor, remove this when fixed!")
    setattr(paddle.Tensor, 'int', func_int)
+    setattr(paddle.static.Variable, 'int', func_int)
 def tolist(x: paddle.Tensor) -> List[Any]:
@@ -354,6 +370,7 @@ if not hasattr(paddle.Tensor, 'tolist'):
    logger.debug(
        "register user tolist to paddle.Tensor, remove this when fixed!")
    setattr(paddle.Tensor, 'tolist', tolist)
+    setattr(paddle.static.Variable, 'tolist', tolist)
 ########### hack paddle.nn #############
 from paddle.nn import Layer

--- a/paddlespeech/server/bin/paddlespeech_client.py
+++ b/paddlespeech/server/bin/paddlespeech_client.py
@@ -16,7 +16,6 @@ import asyncio
 import base64
 import io
 import json
-import logging
 import os
 import random
 import time
@@ -30,7 +29,7 @@ from ..executor import BaseExecutor
 from ..util import cli_client_register
 from ..util import stats_wrapper
 from paddlespeech.cli.log import logger
-from paddlespeech.server.utils.audio_handler import ASRAudioHandler
+from paddlespeech.server.utils.audio_handler import ASRWsAudioHandler
 from paddlespeech.server.utils.audio_process import wav2pcm
 from paddlespeech.server.utils.util import wav2base64
@@ -288,6 +287,12 @@ class ASRClientExecutor(BaseExecutor):
            default=None,
            help='Audio file to be recognized',
            required=True)
+        self.parser.add_argument(
+            '--protocol',
+            type=str,
+            default="http",
+            choices=["http", "websocket"],
+            help='server protocol')
        self.parser.add_argument(
            '--sample_rate', type=int, default=16000, help='audio sample rate')
        self.parser.add_argument(
@@ -295,6 +300,19 @@ class ASRClientExecutor(BaseExecutor):
        self.parser.add_argument(
            '--audio_format', type=str, default="wav", help='audio format')
+        self.parser.add_argument(
+            '--punc.server_ip',
+            type=str,
+            default=None,
+            dest="punc_server_ip",
+            help='Punctuation server ip')
+        self.parser.add_argument(
+            '--punc.port',
+            type=int,
+            default=8091,
+            dest="punc_server_port",
+            help='Punctuation server port')
    def execute(self, argv: List[str]) -> bool:
        args = self.parser.parse_args(argv)
        input_ = args.input
@@ -303,6 +321,7 @@ class ASRClientExecutor(BaseExecutor):
        sample_rate = args.sample_rate
        lang = args.lang
        audio_format = args.audio_format
+        protocol = args.protocol
        try:
            time_start = time.time()
@@ -312,13 +331,17 @@ class ASRClientExecutor(BaseExecutor):
                port=port,
                sample_rate=sample_rate,
                lang=lang,
-                audio_format=audio_format)
+                audio_format=audio_format,
+                protocol=protocol,
+                punc_server_ip=args.punc_server_ip,
+                punc_server_port=args.punc_server_port)
            time_end = time.time()
-            logger.info(res.json())
+            logger.info(f"ASR result: {res}")
            logger.info("Response time %f s." % (time_end - time_start))
            return True
        except Exception as e:
            logger.error("Failed to speech recognition.")
+            logger.error(e)
            return False
    @stats_wrapper
@@ -328,21 +351,39 @@ class ASRClientExecutor(BaseExecutor):
                 port: int=8090,
                 sample_rate: int=16000,
                 lang: str="zh_cn",
-                 audio_format: str="wav"):
+                 audio_format: str="wav",
-        """
+                 protocol: str="http",
-        Python API to call an executor.
+                 punc_server_ip: str=None,
-        """
+                 punc_server_port: int=None):
+        """Python API to call an executor.
-        url = 'http://' + server_ip + ":" + str(port) + '/paddlespeech/asr'
+        Args:
-        audio = wav2base64(input)
+            input (str): The input audio file path
-        data = {
+            server_ip (str, optional): The ASR server ip. Defaults to "127.0.0.1".
-            "audio": audio,
+            port (int, optional): The ASR server port. Defaults to 8090.
-            "audio_format": audio_format,
+            sample_rate (int, optional): The audio sample rate. Defaults to 16000.
-            "sample_rate": sample_rate,
+            lang (str, optional): The audio language type. Defaults to "zh_cn".
-            "lang": lang,
+            audio_format (str, optional): The audio format information. Defaults to "wav".
-        }
+            protocol (str, optional): The ASR server. Defaults to "http".
+        Returns:
+            str: The ASR results
+        """
+        # we use the asr server to recognize the audio text content
+        # and paddlespeech_client asr only support http protocol
+        protocol = "http"
+        if protocol.lower() == "http":
+            from paddlespeech.server.utils.audio_handler import ASRHttpHandler
+            logger.info("asr http client start")
+            handler = ASRHttpHandler(server_ip=server_ip, port=port)
+            res = handler.run(input, audio_format, sample_rate, lang)
+            res = res['result']['transcription']
+            logger.info("asr http client finished")
+        else:
+            logger.error(f"Sorry, we have not support protocol: {protocol},"
+                         "please use http or websocket protocol")
+            sys.exit(-1)
-        res = requests.post(url=url, data=json.dumps(data))
        return res
@@ -379,7 +420,6 @@ class ASROnlineClientExecutor(BaseExecutor):
        sample_rate = args.sample_rate
        lang = args.lang
        audio_format = args.audio_format
        try:
            time_start = time.time()
            res = self(
@@ -409,14 +449,13 @@ class ASROnlineClientExecutor(BaseExecutor):
        """
        Python API to call an executor.
        """
-        logging.basicConfig(level=logging.INFO)
+        logger.info("asr websocket client start")
-        logging.info("asr websocket client start")
+        handler = ASRWsAudioHandler(server_ip, port)
-        handler = ASRAudioHandler(server_ip, port)
        loop = asyncio.get_event_loop()
        res = loop.run_until_complete(handler.run(input))
-        logging.info("asr websocket client finished")
+        logger.info("asr websocket client finished")
-        return res['asr_results']
+        return res['result']
 @cli_client_register(
@@ -509,7 +548,6 @@ class TextClientExecutor(BaseExecutor):
        input_ = args.input
        server_ip = args.server_ip
        port = args.port
-        output = args.output
        try:
            time_start = time.time()

--- a/paddlespeech/server/conf/tts_online_application.yaml
+++ b/paddlespeech/server/conf/tts_online_application.yaml
-# This is the parameter configuration file for PaddleSpeech Serving.
+# This is the parameter configuration file for streaming tts server.
 #################################################################################
 #                             SERVER SETTING                                    #
@@ -7,8 +7,8 @@ host: 127.0.0.1
 port: 8092
 # The task format in the engin_list is: <speech task>_<engine type>
-# task choices = ['tts_online', 'tts_online-onnx']
+# engine_list choices = ['tts_online', 'tts_online-onnx'], the inference speed of tts_online-onnx is faster than tts_online.
-# protocol = ['websocket', 'http'] (only one can be selected).
+# protocol choices = ['websocket', 'http'] 
 protocol: 'http'
 engine_list: ['tts_online-onnx']
@@ -20,8 +20,9 @@ engine_list: ['tts_online-onnx']
 ################################### TTS #########################################
 ################### speech task: tts; engine_type: online #######################
 tts_online: 
-    # am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc']        
+    # am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc']   
-    am: 'fastspeech2_cnndecoder_csmsc'   
+    # fastspeech2_cnndecoder_csmsc support streaming am infer.     
+    am: 'fastspeech2_csmsc'   
    am_config: 
    am_ckpt: 
    am_stat: 
@@ -31,6 +32,7 @@ tts_online:
    spk_id: 0
    # voc (vocoder) choices=['mb_melgan_csmsc, hifigan_csmsc']
+    # Both mb_melgan_csmsc and hifigan_csmsc support streaming voc inference
    voc: 'mb_melgan_csmsc'
    voc_config: 
    voc_ckpt: 
@@ -39,8 +41,13 @@ tts_online:
    # others
    lang: 'zh'
    device: 'cpu' # set 'gpu:id' or 'cpu'
+    # am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
+    # when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
    am_block: 42
    am_pad: 12
+    # voc_pad and voc_block voc model to streaming voc infer,
+    # when voc model is mb_melgan_csmsc, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
+    # when voc model is hifigan_csmsc, voc_pad set 20, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
    voc_block: 14
    voc_pad: 14
@@ -53,7 +60,8 @@ tts_online:
 ################################### TTS #########################################
 ################### speech task: tts; engine_type: online-onnx #######################
 tts_online-onnx: 
-    # am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx']        
+    # am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx']
+    # fastspeech2_cnndecoder_csmsc_onnx support streaming am infer.        
    am: 'fastspeech2_cnndecoder_csmsc_onnx' 
    # am_ckpt is a list, if am is fastspeech2_cnndecoder_csmsc_onnx, am_ckpt = [encoder model, decoder model, postnet model];
    # if am is fastspeech2_csmsc_onnx, am_ckpt = [ckpt model];
@@ -70,6 +78,7 @@ tts_online-onnx:
        cpu_threads: 4
    # voc (vocoder) choices=['mb_melgan_csmsc_onnx, hifigan_csmsc_onnx']
+    # Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference
    voc: 'hifigan_csmsc_onnx'
    voc_ckpt: 
    voc_sample_rate: 24000
@@ -80,9 +89,15 @@ tts_online-onnx:
    # others
    lang: 'zh'
+    # am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
+    # when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
    am_block: 42
    am_pad: 12
+    # voc_pad and voc_block voc model to streaming voc infer,
+    # when voc model is mb_melgan_csmsc_onnx, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
+    # when voc model is hifigan_csmsc_onnx, voc_pad set 20, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
    voc_block: 14
    voc_pad: 14
+    # voc_upsample should be same as n_shift on voc config.
    voc_upsample: 300
--- a/paddlespeech/server/engine/asr/online/asr_engine.py
+++ b/paddlespeech/server/engine/asr/online/asr_engine.py
@@ -43,9 +43,9 @@ __all__ = ['ASREngine']
 pretrained_models = {
    "deepspeech2online_aishell-zh-16k": {
        'url':
-        'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz',
+        'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_fbank161_ckpt_0.2.1.model.tar.gz',
        'md5':
-        '23e16c69730a1cb5d735c98c83c21e16',
+        '98b87b171b7240b7cae6e07d8d0bc9be',
        'cfg_path':
        'model.yaml',
        'ckpt_path':

--- a/paddlespeech/server/tests/asr/online/microphone_client.py
+++ b/paddlespeech/server/tests/asr/online/microphone_client.py
@@ -26,7 +26,7 @@ import pyaudio
 import websockets
-class ASRAudioHandler(threading.Thread):
+class ASRWsAudioHandler(threading.Thread):
    def __init__(self, url="127.0.0.1", port=8091):
        threading.Thread.__init__(self)
        self.url = url
@@ -148,7 +148,7 @@ if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)
    logging.info("asr websocket client start")
-    handler = ASRAudioHandler("127.0.0.1", 8091)
+    handler = ASRWsAudioHandler("127.0.0.1", 8091)
    loop = asyncio.get_event_loop()
    main_task = asyncio.ensure_future(handler.run())
    for signal in [SIGINT, SIGTERM]:

--- a/paddlespeech/server/util.py
+++ b/paddlespeech/server/util.py
@@ -24,11 +24,11 @@ from typing import Any
 from typing import Dict
 import paddle
+import paddleaudio
 import requests
 import yaml
 from paddle.framework import load
-import paddleaudio
 from . import download
 from .entry import client_commands
 from .entry import server_commands

--- a/paddlespeech/server/utils/audio_handler.py
+++ b/paddlespeech/server/utils/audio_handler.py
@@ -24,19 +24,76 @@ import websockets
 from paddlespeech.cli.log import logger
 from paddlespeech.server.utils.audio_process import save_audio
+from paddlespeech.server.utils.util import wav2base64
-class ASRAudioHandler:
+class TextHttpHandler:
-    def __init__(self, url="127.0.0.1", port=8090):
+    def __init__(self, server_ip="127.0.0.1", port=8090):
+        """Text http client request 
+        Args:
+            server_ip (str, optional): the text server ip. Defaults to "127.0.0.1".
+            port (int, optional): the text server port. Defaults to 8090.
+        """
+        super().__init__()
+        self.server_ip = server_ip
+        self.port = port
+        if server_ip is None or port is None:
+            self.url = None
+        else:
+            self.url = 'http://' + self.server_ip + ":" + str(
+                self.port) + '/paddlespeech/text'
+    def run(self, text):
+        """Call the text server to process the specific text
+        Args:
+            text (str): the text to be processed
+        Returns:
+            str: punctuation text
+        """
+        if self.server_ip is None or self.port is None:
+            return text
+        request = {
+            "text": text,
+        }
+        try:
+            res = requests.post(url=self.url, data=json.dumps(request))
+            response_dict = res.json()
+            punc_text = response_dict["result"]["punc_text"]
+        except Exception as e:
+            logger.error(f"Call punctuation {self.url} occurs error")
+            logger.error(e)
+            punc_text = text
+        return punc_text
+class ASRWsAudioHandler:
+    def __init__(self,
+                 url=None,
+                 port=None,
+                 endpoint="/paddlespeech/asr/streaming",
+                 punc_server_ip=None,
+                 punc_server_port=None):
        """PaddleSpeech Online ASR Server Client  audio handler
           Online asr server use the websocket protocal
        Args:
-            url (str, optional): the server ip. Defaults to "127.0.0.1".
+            url (str, optional): the server ip. Defaults to None.
-            port (int, optional): the server port. Defaults to 8090.
+            port (int, optional): the server port. Defaults to None.
+            endpoint(str, optional): to compatiable with python server and c++ server.
+            punc_server_ip(str, optional): the punctuation server ip. Defaults to None. 
+            punc_server_port(int, optional): the punctuation port. Defaults to None
        """
        self.url = url
        self.port = port
-        self.url = "ws://" + self.url + ":" + str(self.port) + "/ws/asr"
+        if url is None or port is None or endpoint is None:
+            self.url = None
+        else:
+            self.url = "ws://" + self.url + ":" + str(self.port) + endpoint
+        self.punc_server = TextHttpHandler(punc_server_ip, punc_server_port)
+        logger.info(f"endpoint: {self.url}")
    def read_wave(self, wavfile_path: str):
        """read the audio file from specific wavfile path
@@ -80,6 +137,10 @@ class ASRAudioHandler:
        """
        logging.info("send a message to the server")
+        if self.url is None:
+            logger.error("No asr server, please input valid ip and port")
+            return ""
        # 1. send websocket handshake protocal
        async with websockets.connect(self.url) as ws:
            # 2. server has already received handshake protocal
@@ -88,28 +149,31 @@ class ASRAudioHandler:
                {
                    "name": "test.wav",
                    "signal": "start",
-                    "nbest": 5
+                    "nbest": 1
                },
                sort_keys=True,
                indent=4,
                separators=(',', ': '))
            await ws.send(audio_info)
            msg = await ws.recv()
-            logger.info("receive msg={}".format(msg))
+            logger.info("client receive msg={}".format(msg))
            # 3. send chunk audio data to engine
            for chunk_data in self.read_wave(wavfile_path):
                await ws.send(chunk_data.tobytes())
                msg = await ws.recv()
                msg = json.loads(msg)
-                logger.info("receive msg={}".format(msg))
+                if self.punc_server and len(msg["result"]) > 0:
+                    msg["result"] = self.punc_server.run(msg["result"])
+                logger.info("client receive msg={}".format(msg))
            # 4. we must send finished signal to the server
            audio_info = json.dumps(
                {
                    "name": "test.wav",
                    "signal": "end",
-                    "nbest": 5
+                    "nbest": 1
                },
                sort_keys=True,
                indent=4,
@@ -119,11 +183,63 @@ class ASRAudioHandler:
            # 5. decode the bytes to str
            msg = json.loads(msg)
-            logger.info("final receive msg={}".format(msg))
+            if self.punc_server:
+                msg["result"] = self.punc_server.run(msg["result"])
+            logger.info("client final receive msg={}".format(msg))
            result = msg
            return result
+class ASRHttpHandler:
+    def __init__(self, server_ip=None, port=None):
+        """The ASR client http request
+        Args:
+            server_ip (str, optional): the http asr server ip. Defaults to "127.0.0.1".
+            port (int, optional): the http asr server port. Defaults to 8090.
+        """
+        super().__init__()
+        self.server_ip = server_ip
+        self.port = port
+        if server_ip is None or port is None:
+            self.url = None
+        else:
+            self.url = 'http://' + self.server_ip + ":" + str(
+                self.port) + '/paddlespeech/asr'
+    def run(self, input, audio_format, sample_rate, lang):
+        """Call the http asr to process the audio
+        Args:
+            input (str): the audio file path
+            audio_format (str): the audio format
+            sample_rate (str): the audio sample rate
+            lang (str): the audio language type
+        Returns:
+            str: the final asr result
+        """
+        if self.url is None:
+            logger.error(
+                "No punctuation server, please input valid ip and port")
+            return ""
+        audio = wav2base64(input)
+        data = {
+            "audio": audio,
+            "audio_format": audio_format,
+            "sample_rate": sample_rate,
+            "lang": lang,
+        }
+        res = requests.post(url=self.url, data=json.dumps(data))
+        return res.json()
 class TTSWsHandler:
    def __init__(self, server="127.0.0.1", port=8092, play: bool=False):
        """PaddleSpeech Online TTS Server Client  audio handler

--- a/paddlespeech/server/ws/asr_socket.py
+++ b/paddlespeech/server/ws/asr_socket.py
@@ -24,7 +24,7 @@ from paddlespeech.server.engine.engine_pool import get_engine_pool
 router = APIRouter()
-@router.websocket('/ws/asr')
+@router.websocket('/paddlespeech/asr/streaming')
 async def websocket_endpoint(websocket: WebSocket):
    """PaddleSpeech Online ASR Server api
@@ -83,7 +83,7 @@ async def websocket_endpoint(websocket: WebSocket):
                    resp = {
                        "status": "ok",
                        "signal": "finished",
-                        'asr_results': asr_results
+                        'result': asr_results
                    }
                    await websocket.send_json(resp)
                    break
@@ -102,7 +102,7 @@ async def websocket_endpoint(websocket: WebSocket):
                # return the current period result
                # if the engine create the vad instance, this connection will have many period results 
-                resp = {'asr_results': asr_results}
+                resp = {'result': asr_results}
                await websocket.send_json(resp)
    except WebSocketDisconnect:
        pass
--- a/setup.py
+++ b/setup.py
@@ -73,8 +73,6 @@ server = [
    "uvicorn",
    "pattern_singleton",
    "websockets",
-    "websocket",
-    "websocket-client",
 ]
 requirements = {

--- a/speechx/README.md
+++ b/speechx/README.md
@@ -24,8 +24,6 @@ docker run --privileged  --net=host --ipc=host -it --rm -v $PWD:/workspace --nam
 * More `Paddle` docker images you can see [here](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/docker/linux-docker.html).
-* If you want only work under cpu, please download corresponded [image](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/docker/linux-docker.html), and using `docker` instead `nvidia-docker`.
 2. Build `speechx` and `examples`.

--- a/speechx/examples/ds2_ol/aishell/run.sh
+++ b/speechx/examples/ds2_ol/aishell/run.sh
@@ -79,6 +79,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
        --feature_wspecifier=ark,scp:$data/split${nj}/JOB/feat.ark,$data/split${nj}/JOB/feat.scp \
        --cmvn_file=$cmvn \
        --streaming_chunk=0.36
+    echo "feature make have finished!!!"
 fi
 if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
@@ -94,6 +95,8 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    cat $data/split${nj}/*/result > $exp/${label_file}
    utils/compute-wer.py --char=1 --v=1 $text $exp/${label_file} > $exp/${wer}
+    echo "ctc-prefix-beam-search-decoder-ol without lm has finished!!!"
+    echo "please checkout in ${exp}/${wer}"
 fi
 if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
@@ -110,6 +113,8 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
    cat $data/split${nj}/*/result_lm > $exp/${label_file}_lm
    utils/compute-wer.py --char=1 --v=1 $text $exp/${label_file}_lm > $exp/${wer}.lm
+    echo "ctc-prefix-beam-search-decoder-ol with lm test has finished!!!"
+    echo "please checkout in ${exp}/${wer}.lm"
 fi
 wfst=$data/wfst/
@@ -139,6 +144,8 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
    cat $data/split${nj}/*/result_tlg > $exp/${label_file}_tlg
    utils/compute-wer.py --char=1 --v=1 $text $exp/${label_file}_tlg > $exp/${wer}.tlg
+    echo "wfst-decoder-ol have finished!!!"
+    echo "please checkout in ${exp}/${wer}.tlg"
 fi
 if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
@@ -159,4 +166,6 @@ if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
    cat $data/split${nj}/*/result_recognizer > $exp/${label_file}_recognizer
    utils/compute-wer.py --char=1 --v=1 $text $exp/${label_file}_recognizer > $exp/${wer}.recognizer
+    echo "recognizer test have finished!!!"
+    echo "please checkout in ${exp}/${wer}.recognizer"
 fi
--- a/speechx/speechx/decoder/ctc_tlg_decoder.cc
+++ b/speechx/speechx/decoder/ctc_tlg_decoder.cc
@@ -48,6 +48,12 @@ void TLGDecoder::Reset() {
 }
 std::string TLGDecoder::GetFinalBestPath() {
+    if (frame_decoded_size_ == 0) {
+        // Assertion failed: (this->NumFramesDecoded() > 0 && "You cannot call
+        // BestPathEnd if no frames were decoded.")
+        return std::string("");
+    }
    decoder_->FinalizeDecoding();
    kaldi::Lattice lat;
    kaldi::LatticeWeight weight;

--- a/speechx/speechx/websocket/websocket_server.cc
+++ b/speechx/speechx/websocket/websocket_server.cc
@@ -27,26 +27,27 @@ ConnectionHandler::ConnectionHandler(
    : ws_(std::move(socket)), recognizer_resource_(recognizer_resource) {}
 void ConnectionHandler::OnSpeechStart() {
-    LOG(INFO) << "Recieved speech start signal, start reading speech";
-    got_start_tag_ = true;
-    json::value rv = {{"status", "ok"}, {"type", "server_ready"}};
-    ws_.text(true);
-    ws_.write(asio::buffer(json::serialize(rv)));
    recognizer_ = std::make_shared<Recognizer>(recognizer_resource_);
    // Start decoder thread
    decode_thread_ = std::make_shared<std::thread>(
        &ConnectionHandler::DecodeThreadFunc, this);
+    got_start_tag_ = true;
+    LOG(INFO) << "Server: Recieved speech start signal, start reading speech";
+    json::value rv = {{"status", "ok"}, {"type", "server_ready"}};
+    ws_.text(true);
+    ws_.write(asio::buffer(json::serialize(rv)));
 }
 void ConnectionHandler::OnSpeechEnd() {
-    LOG(INFO) << "Recieved speech end signal";
+    LOG(INFO) << "Server: Recieved speech end signal";
-    CHECK(recognizer_ != nullptr);
+    if (recognizer_ != nullptr) {
-    recognizer_->SetFinished();
+        recognizer_->SetFinished();
+    }
    got_end_tag_ = true;
 }
 void ConnectionHandler::OnFinalResult(const std::string& result) {
-    LOG(INFO) << "Final result: " << result;
+    LOG(INFO) << "Server: Final result: " << result;
    json::value rv = {
        {"status", "ok"}, {"type", "final_result"}, {"result", result}};
    ws_.text(true);
@@ -69,10 +70,16 @@ void ConnectionHandler::OnSpeechData(const beast::flat_buffer& buffer) {
        pcm_data(i) = static_cast<float>(*pdata);
        pdata++;
    }
-    VLOG(2) << "Recieved " << num_samples << " samples";
+    VLOG(2) << "Server: Recieved " << num_samples << " samples";
-    LOG(INFO) << "Recieved " << num_samples << " samples";
+    LOG(INFO) << "Server: Recieved " << num_samples << " samples";
    CHECK(recognizer_ != nullptr);
    recognizer_->Accept(pcm_data);
+    // TODO: return lpartial result
+    json::value rv = {
+        {"status", "ok"}, {"type", "partial_result"}, {"result", "TODO"}};
+    ws_.text(true);
+    ws_.write(asio::buffer(json::serialize(rv)));
 }
 void ConnectionHandler::DecodeThreadFunc() {
@@ -80,9 +87,9 @@ void ConnectionHandler::DecodeThreadFunc() {
        while (true) {
            recognizer_->Decode();
            if (recognizer_->IsFinished()) {
-                LOG(INFO) << "enter finish";
+                LOG(INFO) << "Server: enter finish";
                recognizer_->Decode();
-                LOG(INFO) << "finish";
+                LOG(INFO) << "Server: finish";
                std::string result = recognizer_->GetFinalResult();
                OnFinalResult(result);
                OnFinish();
@@ -135,7 +142,7 @@ void ConnectionHandler::operator()() {
            ws_.read(buffer);
            if (ws_.got_text()) {
                std::string message = beast::buffers_to_string(buffer.data());
-                LOG(INFO) << message;
+                LOG(INFO) << "Server: Text: " << message;
                OnText(message);
                if (got_end_tag_) {
                    break;
@@ -152,7 +159,7 @@ void ConnectionHandler::operator()() {
            }
        }
-        LOG(INFO) << "Read all pcm data, wait for decoding thread";
+        LOG(INFO) << "Server: finished to wait for decoding thread join.";
        if (decode_thread_ != nullptr) {
            decode_thread_->join();
        }

--- a/tests/unit/cli/test_cli.sh
+++ b/tests/unit/cli/test_cli.sh
@@ -14,22 +14,24 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
 paddlespeech asr --input ./zh.wav
 paddlespeech asr --model conformer_aishell --input ./zh.wav
 paddlespeech asr --model conformer_online_aishell --input ./zh.wav
+paddlespeech asr --model conformer_online_multicn --input ./zh.wav
 paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav
 paddlespeech asr --model deepspeech2offline_aishell --input ./zh.wav
+paddlespeech asr --model deepspeech2online_wenetspeech --input ./zh.wav
 paddlespeech asr --model deepspeech2online_aishell --input ./zh.wav
 paddlespeech asr --model deepspeech2offline_librispeech --lang en --input ./en.wav
 # long audio restriction
 {
-wget -c wget https://paddlespeech.bj.bcebos.com/datasets/single_wav/zh/test_long_audio_01.wav
+wget -c https://paddlespeech.bj.bcebos.com/datasets/single_wav/zh/test_long_audio_01.wav
 paddlespeech asr --input test_long_audio_01.wav
 if [ $? -ne 255 ]; then
-   echo "Time restriction not passed"
+   echo -e "\e[1;31mTime restriction not passed\e[0m"
   exit 1
 fi
 } &&
 {
- echo "Time restriction passed"
+ echo -e "\033[32mTime restriction passed\033[0m"
 }
 # Text To Speech
@@ -77,4 +79,4 @@ paddlespeech stats --task vector
 paddlespeech stats --task st
-echo "Test success !!!"
+echo -e "\033[32mTest success !!!\033[0m"