Merge branch 'develop' of github.com:SmileGoat/PaddleSpeech into refactor_file_struct

c76c4800 · Yang Zhou · 537aff97 · 1a8b478b · c76c4800 · c76c4800
136 changed file
--- a/README.md
+++ b/README.md
 ([简体中文](./README_cn.md)|English)
-
-
-
 <p align="center">
  <img src="./docs/images/PaddleSpeech_logo.png" />
 </p>
@@ -20,20 +17,17 @@
    <a href="https://huggingface.co/spaces"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"></a>
 </p>
 <div align="center">  
-<h3>
-  | <a href="#quick-start"> Quick Start </a>
+<h4>
+    <a href="#quick-start"> Quick Start </a>
  | <a href="#quick-start-server"> Quick Start Server </a>
  | <a href="#quick-start-streaming-server"> Quick Start Streaming Server</a>
-  |
-  </br>
  | <a href="#documents"> Documents </a>
  | <a href="#model-list"> Models List </a>
-  |
-</h3>
+  | <a href="https://aistudio.baidu.com/aistudio/education/group/info/25130"> AIStudio Courses </a>
+</h4>
 </div>

-
-
+------------------------------------------------------------------------------------

 **PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models.

@@ -170,23 +164,12 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
 - 🤗  2021.12.14: [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) and [TTS](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS) Demos on Hugging Face Spaces are available!
 - 👏🏻  2021.12.10: `CLI` is available for `Audio Classification`, `Automatic Speech Recognition`, `Speech Translation (English to Chinese)` and `Text-to-Speech`.

-### 🔥 Hot Activities
-
-<!---
-2021.12.14: We would like to have an online courses to introduce basics and research of speech, as well as code practice with `paddlespeech`. Please pay attention to our [Calendar](https://www.paddlepaddle.org.cn/live).
--->
-
- 2021.12.21~12.24
-
-  4 Days Live Courses: Depth interpretation of PaddleSpeech!
-
-  **Courses videos and related materials: https://aistudio.baidu.com/aistudio/education/group/info/25130**

 ### Community
- Scan the QR code below with your Wechat (reply【语音】after your friend's application is approved), you can access to official technical exchange group. Look forward to your participation.
+- Scan the QR code below with your Wechat, you can access to official technical exchange group and get the bonus ( more than 20GB learning materials, such as papers, codes and videos ) and the live link of the lessons. Look forward to your participation.

 <div align="center">
-<img src="https://raw.githubusercontent.com/yt605155624/lanceTest/main/images/wechat_4.jpg"  width = "300"  />
+<img src="https://user-images.githubusercontent.com/23690325/169763015-cbd8e28d-602c-4723-810d-dbc6da49441e.jpg"  width = "200"  />
 </div>

 ## Installation

--- a/README_cn.md
+++ b/README_cn.md
@@ -18,40 +18,19 @@
    <a href="https://huggingface.co/spaces"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"></a>
 </p>
 <div align="center">  
-<h3>
-  <a href="#quick-start"> Quick Start </a>
-  | <a href="#quick-start-server"> Quick Start Server </a>
-  | <a href="#quick-start-streaming-server"> Quick Start Streaming Server</a>
-  </br>
-  <a href="#documents"> Documents </a>
-  | <a href="#model-list"> Models List </a>
-</h3>
+<h4>
+    <a href="#快速开始"> 快速开始 </a>
+  | <a href="#快速使用服务"> 快速使用服务 </a>
+  | <a href="#快速使用流式服务"> 快速使用流式服务 </a>
+  | <a href="#教程文档"> 教程文档 </a>
+  | <a href="#模型列表"> 模型列表 </a>
+  | <a href="https://aistudio.baidu.com/aistudio/education/group/info/25130"> AIStudio 课程 </a>
+</h4>
 </div>


 ------------------------------------------------------------------------------------

-<div align="center">  
-  <h3>
-  <a href="#quick-start"> 快速开始 </a>
-  | <a href="#quick-start-server"> 快速使用服务 </a>
-  | <a href="#quick-start-streaming-server"> 快速使用流式服务 </a>
-  | <a href="#documents"> 教程文档 </a>
-  | <a href="#model-list"> 模型列表 </a>
-</div>
-
-
-
-<!---
-from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readmes-readable.md
-1.What is this repo or project? (You can reuse the repo description you used earlier because this section doesn’t have to be long.)
-2.How does it work?
-3.Who will use this repo or project?
-4.What is the goal of this project?
-->
-
-
-
 **PaddleSpeech** 是基于飞桨 [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) 的语音方向的开源模型库，用于语音和音频中的各种关键任务的开发，包含大量基于深度学习前沿和有影响力的模型，一些典型的应用示例如下：
 ##### 语音识别

@@ -178,39 +157,30 @@ from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readme


 ### 近期更新
-
-<!---
-2021.12.14: We would like to have an online courses to introduce basics and research of speech, as well as code practice with `paddlespeech`. Please pay attention to our [Calendar](https://www.paddlepaddle.org.cn/live).
--->
- 👑 2022.05.13: PaddleSpeech 发布 [PP-ASR](./docs/source/asr/PPASR_cn.md)、[PP-TTS](./docs/source/tts/PPTTS_cn.md)、[PP-VPR](docs/source/vpr/PPVPR_cn.md)
+- 👑 2022.05.13: PaddleSpeech 发布 [PP-ASR](./docs/source/asr/PPASR_cn.md) 流式语音识别系统、[PP-TTS](./docs/source/tts/PPTTS_cn.md) 流式语音合成系统、[PP-VPR](docs/source/vpr/PPVPR_cn.md) 全链路声纹识别系统
 - 👏🏻 2022.05.06: PaddleSpeech Streaming Server 上线! 覆盖了语音识别（标点恢复、时间戳），和语音合成。
 - 👏🏻 2022.05.06: PaddleSpeech Server 上线! 覆盖了声音分类、语音识别、语音合成、声纹识别，标点恢复。
 - 👏🏻 2022.03.28: PaddleSpeech CLI 覆盖声音分类、语音识别、语音翻译（英译中）、语音合成，声纹验证。
 - 🤗 2021.12.14: PaddleSpeech [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) and [TTS](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS) Demos on Hugging Face Spaces are available!

-### 🔥 热门活动
-
- 2021.12.21~12.24

-  4 日直播课: 深度解读 PaddleSpeech 语音技术!
+ ### 🔥 加入技术交流群获取入群福利

-  **直播回放与课件资料: https://aistudio.baidu.com/aistudio/education/group/info/25130**
+ - 3 日直播课链接: 深度解读 PP-TTS、PP-ASR、PP-VPR 三项核心语音系统关键技术
+ - 20G 学习大礼包：视频课程、前沿论文与学习资料
  
-
-### 技术交流群
-微信扫描二维码（好友申请通过后回复【语音】）加入官方交流群，获得更高效的问题答疑，与各行各业开发者充分交流，期待您的加入。
+微信扫描二维码关注公众号，点击“马上报名”填写问卷加入官方交流群，获得更高效的问题答疑，与各行各业开发者充分交流，期待您的加入。

 <div align="center">
-<img src="https://raw.githubusercontent.com/yt605155624/lanceTest/main/images/wechat_4.jpg"  width = "300"  />
+<img src="https://user-images.githubusercontent.com/23690325/169763015-cbd8e28d-602c-4723-810d-dbc6da49441e.jpg"  width = "200"  />
 </div>

-
 ## 安装

 我们强烈建议用户在 **Linux** 环境下，*3.7* 以上版本的 *python* 上安装 PaddleSpeech。
 目前为止，**Linux** 支持声音分类、语音识别、语音合成和语音翻译四种功能，**Mac OSX、 Windows** 下暂不支持语音翻译功能。 想了解具体安装细节，可以参考[安装文档](./docs/source/install_cn.md)。

-
+<a name="快速开始"></a>
 ## 快速开始

 安装完成后，开发者可以通过命令行快速开始，改变 `--input` 可以尝试用自己的音频或文本测试。
@@ -257,7 +227,7 @@ paddlespeech asr --input ./zh.wav | paddlespeech text --task punc
 更多命令行命令请参考 [demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos)
 > Note: 如果需要训练或者微调，请查看[语音识别](./docs/source/asr/quick_start.md)， [语音合成](./docs/source/tts/quick_start.md)。

-
+<a name="快速使用服务"></a>
 ## 快速使用服务
 安装完成后，开发者可以通过命令行快速使用服务。

@@ -283,30 +253,30 @@ paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav

 更多服务相关的命令行使用信息，请参考 [demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/speech_server)

-<a name="quickstartstreamingserver"></a>
+<a name="快速使用流式服务"></a>
 ## 快速使用流式服务

-开发者可以尝试[流式ASR](./demos/streaming_asr_server/README.md)和 [流式TTS](./demos/streaming_tts_server/README.md)服务.
+开发者可以尝试 [流式 ASR](./demos/streaming_asr_server/README.md) 和 [流式 TTS](./demos/streaming_tts_server/README.md) 服务.

-**启动流式ASR服务**
+**启动流式 ASR 服务**

 ```
 paddlespeech_server start --config_file ./demos/streaming_asr_server/conf/application.yaml
 ```

-**访问流式ASR服务**     
+**访问流式 ASR 服务**     

 ```
 paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input input_16k.wav
 ```

-**启动流式TTS服务**
+**启动流式 TTS 服务**

 ```
 paddlespeech_server start --config_file ./demos/streaming_tts_server/conf/tts_online_application.yaml
 ```

-**访问流式TTS服务**     
+**访问流式 TTS 服务**     

 ```
 paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.wav
@@ -314,8 +284,7 @@ paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http

 更多信息参看： [流式 ASR](./demos/streaming_asr_server/README.md) 和 [流式 TTS](./demos/streaming_tts_server/README.md) 

-<a name="modulelist"></a>
-
+<a name="模型列表"></a>
 ## 模型列表
 PaddleSpeech 支持很多主流的模型，并提供了预训练模型，详情请见[模型列表](./docs/source/released_model.md)。

@@ -587,6 +556,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
  </tbody>
 </table>

+<a name="教程文档"></a>
 ## 教程文档

 对于 PaddleSpeech 的所关注的任务，以下指南有助于帮助开发者快速入门，了解语音相关核心思想。
@@ -668,7 +638,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
 <a name="欢迎贡献"></a>
 ## 参与 PaddleSpeech 的开发

-热烈欢迎您在[Discussions](https://github.com/PaddlePaddle/PaddleSpeech/discussions) 中提交问题，并在[Issues](https://github.com/PaddlePaddle/PaddleSpeech/issues) 中指出发现的 bug。此外，我们非常希望您参与到 PaddleSpeech 的开发中！
+热烈欢迎您在 [Discussions](https://github.com/PaddlePaddle/PaddleSpeech/discussions) 中提交问题，并在 [Issues](https://github.com/PaddlePaddle/PaddleSpeech/issues) 中指出发现的 bug。此外，我们非常希望您参与到 PaddleSpeech 的开发中！

 ### 贡献者
 <p align="center">

--- a/demos/audio_content_search/README.md
+++ b/demos/audio_content_search/README.md
@@ -16,7 +16,12 @@ see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/doc

 You can choose one way from meduim and hard to install paddlespeech.

-The dependency refers to the requirements.txt
+The dependency refers to the requirements.txt, and install the dependency as follows:
+
+```
+pip install -r requriement.txt 
+```
+
 ### 2. Prepare Input File
 The input of this demo should be a WAV file(`.wav`), and the sample rate must be the same as the model.


--- a/demos/audio_content_search/README_cn.md
+++ b/demos/audio_content_search/README_cn.md
@@ -16,7 +16,11 @@
 请看[安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install_cn.md)。

 你可以从 medium，hard 三中方式中选择一种方式安装。
-依赖参见 requirements.txt
+依赖参见 requirements.txt, 安装依赖
+
+```
+pip install -r requriement.txt 
+```

 ### 2. 准备输入
 这个 demo 的输入应该是一个 WAV 文件（`.wav`），并且采样率必须与模型的采样率相同。

--- a/demos/audio_content_search/conf/acs_application.yaml
+++ b/demos/audio_content_search/conf/acs_application.yaml
@@ -28,6 +28,7 @@ acs_python:
    word_list: "./conf/words.txt"
    sample_rate: 16000
    device: 'cpu' # set 'gpu:id' or 'cpu'
+    ping_timeout: 100 # seconds




--- a/demos/audio_content_search/requirements.txt
+++ b/demos/audio_content_search/requirements.txt
+websocket-client
\ No newline at end of file
--- a/demos/audio_content_search/streaming_asr_server.py
+++ b/demos/audio_content_search/streaming_asr_server.py
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+
+from paddlespeech.cli.log import logger
+from paddlespeech.server.bin.paddlespeech_server import ServerExecutor
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        prog='paddlespeech_server.start', add_help=True)
+    parser.add_argument(
+        "--config_file",
+        action="store",
+        help="yaml file of the app",
+        default=None,
+        required=True)
+
+    parser.add_argument(
+        "--log_file",
+        action="store",
+        help="log file",
+        default="./log/paddlespeech.log")
+    logger.info("start to parse the args")
+    args = parser.parse_args()
+
+    logger.info("start to launch the streaming asr server")
+    streaming_asr_server = ServerExecutor()
+    streaming_asr_server(config_file=args.config_file, log_file=args.log_file)
--- a/demos/audio_searching/src/operations/load.py
+++ b/demos/audio_searching/src/operations/load.py
@@ -26,8 +26,7 @@ def get_audios(path):
    """
    supported_formats = [".wav", ".mp3", ".ogg", ".flac", ".m4a"]
    return [
-        item
-        for sublist in [[os.path.join(dir, file) for file in files]
+        item for sublist in [[os.path.join(dir, file) for file in files]
                             for dir, _, files in list(os.walk(path))]
        for item in sublist if os.path.splitext(item)[1] in supported_formats
    ]

--- a/demos/speaker_verification/README.md
+++ b/demos/speaker_verification/README.md
@@ -53,50 +53,49 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  Output:

  ```bash
-    demo [  1.4217498    5.626253    -5.342073     1.1773866    3.308055
-    1.756596     5.167894    10.80636     -3.8226728   -5.6141334
-    2.623845    -0.8072968    1.9635103   -7.3128724    0.01103897
-    -9.723131     0.6619743   -6.976803    10.213478     7.494748
-    2.9105635    3.8949256    3.7999806    7.1061673   16.905321
-    -7.1493764    8.733103     3.4230042   -4.831653   -11.403367
-    11.232214     7.1274667   -4.2828417    2.452362    -5.130748
-    -18.177666    -2.6116815  -11.000337    -6.7314315    1.6564683
-    0.7618269    1.1253023   -2.083836     4.725744    -8.782597
-    -3.539873     3.814236     5.1420674    2.162061     4.096431
-    -6.4162116   12.747448     1.9429878  -15.152943     6.417416
-    16.097002    -9.716668    -1.9920526   -3.3649497   -1.871939
-    11.567354     3.69788     11.258265     7.442363     9.183411
-    4.5281515   -1.2417862    4.3959084    6.6727695    5.8898783
-    7.627124    -0.66919386 -11.889693    -9.208865    -7.4274073
-    -3.7776625    6.917234    -9.848748    -2.0944717   -5.135116
-    0.49563864   9.317534    -5.9141874   -1.8098574   -0.11738578
-    -7.169265    -1.0578263   -5.7216787   -5.1173844   16.137651
-    -4.473626     7.6624317   -0.55381083   9.631587    -6.4704556
-    -8.548508     4.3716145   -0.79702514   4.478997    -2.9758704
-    3.272176     2.8382776    5.134597    -9.190781    -0.5657382
-    -4.8745747    2.3165567   -5.984303    -2.1798875    0.35541576
-    -0.31784213   9.493548     2.1144536    4.358092   -12.089823
-    8.451689    -7.925461     4.6242585    4.4289427   18.692003
-    -2.6204622   -5.149185    -0.35821092   8.488551     4.981496
-    -9.32683     -2.2544234    6.6417594    1.2119585   10.977129
-    16.555033     3.3238444    9.551863    -1.6676947   -0.79539716
-    -8.605674    -0.47356385   2.6741948   -5.359179    -2.6673796
-    0.66607     15.443222     4.740594    -3.4725387   11.592567
-    -2.054497     1.7361217   -8.265324    -9.30447      5.4068313
-    -1.5180256   -7.746615    -6.089606     0.07112726  -0.34904733
-    -8.649895    -9.998958    -2.564841    -0.53999114   2.601808
-    -0.31927416  -1.8815292   -2.07215     -3.4105783   -8.2998085
-    1.483641   -15.365992    -8.288208     3.8847756   -3.4876456
-    7.3629923    0.4657332    3.132599    12.438889    -1.8337058
-    4.532936     2.7264361   10.145339    -6.521951     2.897153
-    -3.3925855    5.079156     7.759716     4.677565     5.8457737
-    2.402413     7.7071047    3.9711342   -6.390043     6.1268735
-    -3.7760346  -11.118123  ]
+    demo [ -1.3251206    7.8606825   -4.620626     0.3000721    2.2648535
+    -1.1931441    3.0647137    7.673595    -6.0044727  -12.02426
+    -1.9496069    3.1269536    1.618838    -7.6383104   -1.2299773
+  -12.338331     2.1373026   -5.3957124    9.717328     5.6752305
+    3.7805123    3.0597172    3.429692     8.97601     13.174125
+    -0.53132284   8.9424715    4.46511     -4.4262476   -9.726503
+    8.399328     7.2239175   -7.435854     2.9441683   -4.3430395
+  -13.886965    -1.6346735  -10.9027405   -5.311245     3.8007221
+    3.8976038   -2.1230774   -2.3521194    4.151031    -7.4048667
+    0.13911647   2.4626107    4.9664545    0.9897574    5.4839754
+    -3.3574002   10.1340065   -0.6120171  -10.403095     4.6007543
+    16.00935     -7.7836914   -4.1945305   -6.9368606    1.1789556
+    11.490801     4.2380238    9.550931     8.375046     7.5089145
+    -0.65707296  -0.30051577   2.8406055    3.0828028    0.730817
+    6.148354     0.13766119 -13.424735    -7.7461405   -2.3227983
+    -8.305252     2.9879124  -10.995229     0.15211068  -2.3820348
+    -1.7984174    8.495629    -5.8522367   -3.755498     0.6989711
+    -5.2702994   -2.6188622   -1.8828466   -4.64665     14.078544
+    -0.5495333   10.579158    -3.2160501    9.349004    -4.381078
+  -11.675817    -2.8630207    4.5721755    2.246612    -4.574342
+    1.8610188    2.3767874    5.6257877   -9.784078     0.64967257
+    -1.4579505    0.4263264   -4.9211264   -2.454784     3.4869802
+    -0.42654222   8.341269     1.356552     7.0966883  -13.102829
+    8.016734    -7.1159344    1.8699781    0.208721    14.699384
+    -1.025278    -2.6107233   -2.5082312    8.427193     6.9138527
+    -6.2912464    0.6157366    2.489688    -3.4668267    9.921763
+    11.200815    -0.1966403    7.4916005   -0.62312716  -0.25848144
+    -9.947997    -0.9611041    1.1649219   -2.1907122   -1.5028487
+    -0.51926106  15.165954     2.4649463   -0.9980445    7.4416637
+    -2.0768049    3.5896823   -7.3055434   -7.5620847    4.323335
+    0.0804418   -6.56401     -2.3148053   -1.7642345   -2.4708817
+    -7.675618    -9.548878    -1.0177554    0.16986446   2.5877135
+    -1.8752296   -0.36614323  -6.0493784   -2.3965611   -5.9453387
+    0.9424033  -13.155974    -7.457801     0.14658108  -3.742797
+    5.8414927   -1.2872906    5.5694313   12.57059      1.0939219
+    2.2142086    1.9181576    6.9914207   -5.888139     3.1409824
+    -2.003628     2.4434285    9.973139     5.03668      2.0051203
+    2.8615603    5.860224     2.9176188   -1.6311141    2.0292206
+    -4.070415    -6.831437  ]
  ```

 - Python API
  ```python
-  import paddle
  from paddlespeech.cli import VectorExecutor

  vector_executor = VectorExecutor()
@@ -128,88 +127,88 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  ```bash
  # Vector Result:
   Audio embedding Result:
-    [  1.4217498    5.626253    -5.342073     1.1773866    3.308055
-    1.756596     5.167894    10.80636     -3.8226728   -5.6141334
-    2.623845    -0.8072968    1.9635103   -7.3128724    0.01103897
-    -9.723131     0.6619743   -6.976803    10.213478     7.494748
-    2.9105635    3.8949256    3.7999806    7.1061673   16.905321
-    -7.1493764    8.733103     3.4230042   -4.831653   -11.403367
-    11.232214     7.1274667   -4.2828417    2.452362    -5.130748
-    -18.177666    -2.6116815  -11.000337    -6.7314315    1.6564683
-    0.7618269    1.1253023   -2.083836     4.725744    -8.782597
-    -3.539873     3.814236     5.1420674    2.162061     4.096431
-    -6.4162116   12.747448     1.9429878  -15.152943     6.417416
-    16.097002    -9.716668    -1.9920526   -3.3649497   -1.871939
-    11.567354     3.69788     11.258265     7.442363     9.183411
-    4.5281515   -1.2417862    4.3959084    6.6727695    5.8898783
-    7.627124    -0.66919386 -11.889693    -9.208865    -7.4274073
-    -3.7776625    6.917234    -9.848748    -2.0944717   -5.135116
-    0.49563864   9.317534    -5.9141874   -1.8098574   -0.11738578
-    -7.169265    -1.0578263   -5.7216787   -5.1173844   16.137651
-    -4.473626     7.6624317   -0.55381083   9.631587    -6.4704556
-    -8.548508     4.3716145   -0.79702514   4.478997    -2.9758704
-    3.272176     2.8382776    5.134597    -9.190781    -0.5657382
-    -4.8745747    2.3165567   -5.984303    -2.1798875    0.35541576
-    -0.31784213   9.493548     2.1144536    4.358092   -12.089823
-    8.451689    -7.925461     4.6242585    4.4289427   18.692003
-    -2.6204622   -5.149185    -0.35821092   8.488551     4.981496
-    -9.32683     -2.2544234    6.6417594    1.2119585   10.977129
-    16.555033     3.3238444    9.551863    -1.6676947   -0.79539716
-    -8.605674    -0.47356385   2.6741948   -5.359179    -2.6673796
-    0.66607     15.443222     4.740594    -3.4725387   11.592567
-    -2.054497     1.7361217   -8.265324    -9.30447      5.4068313
-    -1.5180256   -7.746615    -6.089606     0.07112726  -0.34904733
-    -8.649895    -9.998958    -2.564841    -0.53999114   2.601808
-    -0.31927416  -1.8815292   -2.07215     -3.4105783   -8.2998085
-    1.483641   -15.365992    -8.288208     3.8847756   -3.4876456
-    7.3629923    0.4657332    3.132599    12.438889    -1.8337058
-    4.532936     2.7264361   10.145339    -6.521951     2.897153
-    -3.3925855    5.079156     7.759716     4.677565     5.8457737
-    2.402413     7.7071047    3.9711342   -6.390043     6.1268735
-    -3.7760346  -11.118123  ]
+    [ -1.3251206    7.8606825   -4.620626     0.3000721    2.2648535
+      -1.1931441    3.0647137    7.673595    -6.0044727  -12.02426
+      -1.9496069    3.1269536    1.618838    -7.6383104   -1.2299773
+    -12.338331     2.1373026   -5.3957124    9.717328     5.6752305
+      3.7805123    3.0597172    3.429692     8.97601     13.174125
+      -0.53132284   8.9424715    4.46511     -4.4262476   -9.726503
+      8.399328     7.2239175   -7.435854     2.9441683   -4.3430395
+    -13.886965    -1.6346735  -10.9027405   -5.311245     3.8007221
+      3.8976038   -2.1230774   -2.3521194    4.151031    -7.4048667
+      0.13911647   2.4626107    4.9664545    0.9897574    5.4839754
+      -3.3574002   10.1340065   -0.6120171  -10.403095     4.6007543
+      16.00935     -7.7836914   -4.1945305   -6.9368606    1.1789556
+      11.490801     4.2380238    9.550931     8.375046     7.5089145
+      -0.65707296  -0.30051577   2.8406055    3.0828028    0.730817
+      6.148354     0.13766119 -13.424735    -7.7461405   -2.3227983
+      -8.305252     2.9879124  -10.995229     0.15211068  -2.3820348
+      -1.7984174    8.495629    -5.8522367   -3.755498     0.6989711
+      -5.2702994   -2.6188622   -1.8828466   -4.64665     14.078544
+      -0.5495333   10.579158    -3.2160501    9.349004    -4.381078
+    -11.675817    -2.8630207    4.5721755    2.246612    -4.574342
+      1.8610188    2.3767874    5.6257877   -9.784078     0.64967257
+      -1.4579505    0.4263264   -4.9211264   -2.454784     3.4869802
+      -0.42654222   8.341269     1.356552     7.0966883  -13.102829
+      8.016734    -7.1159344    1.8699781    0.208721    14.699384
+      -1.025278    -2.6107233   -2.5082312    8.427193     6.9138527
+      -6.2912464    0.6157366    2.489688    -3.4668267    9.921763
+      11.200815    -0.1966403    7.4916005   -0.62312716  -0.25848144
+      -9.947997    -0.9611041    1.1649219   -2.1907122   -1.5028487
+      -0.51926106  15.165954     2.4649463   -0.9980445    7.4416637
+      -2.0768049    3.5896823   -7.3055434   -7.5620847    4.323335
+      0.0804418   -6.56401     -2.3148053   -1.7642345   -2.4708817
+      -7.675618    -9.548878    -1.0177554    0.16986446   2.5877135
+      -1.8752296   -0.36614323  -6.0493784   -2.3965611   -5.9453387
+      0.9424033  -13.155974    -7.457801     0.14658108  -3.742797
+      5.8414927   -1.2872906    5.5694313   12.57059      1.0939219
+      2.2142086    1.9181576    6.9914207   -5.888139     3.1409824
+      -2.003628     2.4434285    9.973139     5.03668      2.0051203
+      2.8615603    5.860224     2.9176188   -1.6311141    2.0292206
+      -4.070415    -6.831437  ]
    # get the test embedding
    Test embedding Result:
-    [ -1.902964     2.0690894   -8.034194     3.5472693    0.18089125
-      6.9085927    1.4097427   -1.9487704  -10.021278    -0.20755845
-      -8.04332      4.344489     2.3200977  -14.306299     5.184692
-    -11.55602     -3.8497238    0.6444722    1.2833948    2.6766639
-      0.5878921    0.7946299    1.7207596    2.5791872   14.998469
-      -1.3385371   15.031221    -0.8006958    1.99287     -9.52007
-      2.435466     4.003221    -4.33817     -4.898601    -5.304714
-    -18.033886    10.790787   -12.784645    -5.641755     2.9761686
-    -10.566622     1.4839455    6.152458    -5.7195854    2.8603241
-      6.112133     8.489869     5.5958056    1.2836679   -1.2293907
-      0.89927405   7.0288725   -2.854029    -0.9782962    5.8255906
-      14.905906    -5.025907     0.7866458   -4.2444224  -16.354029
-      10.521315     0.9604709   -3.3257897    7.144871   -13.592733
-      -8.568869    -1.7953678    0.26313916  10.916714    -6.9374123
-      1.857403    -6.2746415    2.8154466   -7.2338667   -2.293357
-      -0.05452765   5.4287076    5.0849075   -6.690375    -1.6183422
-      3.654291     0.94352573  -9.200294    -5.4749465   -3.5235846
-      1.3420814    4.240421    -2.772944    -2.8451524   16.311104
-      4.2969875   -1.762936   -12.5758915    8.595198    -0.8835239
-      -1.5708797    1.568961     1.1413603    3.5032008   -0.45251232
-      -6.786333    16.89443      5.3366146   -8.789056     0.6355629
-      3.2579517   -3.328322     7.5969577    0.66025066  -6.550468
-      -9.148656     2.020372    -0.4615173    1.1965656   -3.8764873
-      11.6562195   -6.0750933   12.182899     3.2218833    0.81969476
-      5.570001    -3.8459578   -7.205299     7.9262037   -7.6611166
-      -5.249467    -2.2671914    7.2658715  -13.298164     4.821147
-      -2.7263982   11.691089    -3.8918593   -2.838112    -1.0336838
-      -3.8034165    2.8536487   -5.60398     -1.1972581    1.3455094
-      -3.4903061    2.2408795    5.5010734   -3.970756    11.99696
-      -7.8858757    0.43160373  -5.5059714    4.3426995   16.322706
-      11.635366     0.72157705  -9.245714    -3.91465     -4.449838
-      -1.5716927    7.713747    -2.2430465   -6.198303   -13.481864
-      2.8156567   -5.7812386    5.1456156    2.7289324  -14.505571
-      13.270688     3.448231    -7.0659585    4.5886116   -4.466099
-      -0.296428   -11.463529    -2.6076477   14.110243    -6.9725137
-      -1.9962958    2.7119343   19.391657     0.01961198  14.607133
-      -1.6695905   -4.391516     1.3131028   -6.670972    -5.888604
-      12.0612335    5.9285784    3.3715196    1.492534    10.723728
-      -0.95514804 -12.085431  ]
+    [  2.5247195    5.119042    -4.335273     4.4583654    5.047907
+      3.5059214    1.6159848    0.49364898 -11.6899185   -3.1014526
+      -5.6589785   -0.42684984   2.674276   -11.937654     6.2248464
+    -10.776924    -5.694543     1.112041     1.5709964    1.0961034
+      1.3976512    2.324352     1.339981     5.279319    13.734659
+      -2.5753925   13.651442    -2.2357535    5.1575427   -3.251567
+      1.4023279    6.1191974   -6.0845175   -1.3646189   -2.6789894
+    -15.220778     9.779349    -9.411551    -6.388947     6.8313975
+      -9.245996     0.31196198   2.5509644   -4.413065     6.1649427
+      6.793837     2.6328635    8.620976     3.4832475    0.52491665
+      2.9115407    5.8392377    0.6702376   -3.2726715    2.6694255
+      16.91701     -5.5811176    0.23362345  -4.5573606  -11.801059
+      14.728292    -0.5198082   -3.999922     7.0927105   -7.0459595
+      -5.4389      -0.46420583  -5.1085467   10.376568    -8.889225
+      -0.37705845  -1.659806     2.6731026   -7.1909504    1.4608804
+      -2.163136    -0.17949677   4.0241547    0.11319201   0.601279
+      2.039692     3.1910992  -11.649526    -8.121584    -4.8707457
+      0.3851982    1.4231744   -2.3321972    0.99332285  14.121717
+      5.899413     0.7384519  -17.760096    10.555021     4.1366534
+      -0.3391071   -0.20792882   3.208204     0.8847948   -8.721497
+      -6.432868    13.006379     4.8956      -9.155822    -1.9441519
+      5.7815638   -2.066733    10.425042    -0.8802383   -2.4314315
+      -9.869258     0.35095334  -5.3549943    2.1076174   -8.290468
+      8.4433365   -4.689333     9.334139    -2.172678    -3.0250976
+      8.394216    -3.2110903   -7.93868      2.3960824   -2.3213403
+      -1.4963245   -3.476059     4.132903   -10.893354     4.362673
+      -0.45456508  10.258634    -1.1655927   -6.7799754    0.22885278
+      -4.399287     2.333433    -4.84745     -4.2752337   -1.3577863
+      -1.0685898    9.505196     7.3062205    0.08708266  12.927811
+      -9.57974      1.3936648   -1.9444873    5.776769    15.251903
+      10.6118355   -1.4903594   -9.535318    -3.6553776   -1.6699586
+      -0.5933151    7.600357    -4.8815503   -8.698617   -15.855757
+      0.25632986  -7.2235737    0.9506656    0.7128582   -9.051738
+      8.74869     -1.6426028   -6.5762258    2.506905    -6.7431564
+      5.129912   -12.189555    -3.6435068   12.068113    -6.0059533
+      -2.3535995    2.9014351   22.3082      -1.5563312   13.193291
+      2.7583609   -7.468798     1.3407065   -4.599617    -6.2345777
+      10.7689295    7.137627     5.099476     0.3473359    9.647881
+      -2.0484571   -5.8549366 ]
    # get the score between enroll and test
-    Eembeddings Score: 0.4292638301849365
+    Eembeddings Score: 0.45332613587379456
  ```

 ### 4.Pretrained Models

--- a/demos/speaker_verification/README_cn.md
+++ b/demos/speaker_verification/README_cn.md
@@ -51,45 +51,45 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav

  输出：
  ```bash
-  demo  [  1.4217498    5.626253    -5.342073     1.1773866    3.308055
-    1.756596     5.167894    10.80636     -3.8226728   -5.6141334
-    2.623845    -0.8072968    1.9635103   -7.3128724    0.01103897
-    -9.723131     0.6619743   -6.976803    10.213478     7.494748
-    2.9105635    3.8949256    3.7999806    7.1061673   16.905321
-    -7.1493764    8.733103     3.4230042   -4.831653   -11.403367
-    11.232214     7.1274667   -4.2828417    2.452362    -5.130748
-    -18.177666    -2.6116815  -11.000337    -6.7314315    1.6564683
-    0.7618269    1.1253023   -2.083836     4.725744    -8.782597
-    -3.539873     3.814236     5.1420674    2.162061     4.096431
-    -6.4162116   12.747448     1.9429878  -15.152943     6.417416
-    16.097002    -9.716668    -1.9920526   -3.3649497   -1.871939
-    11.567354     3.69788     11.258265     7.442363     9.183411
-    4.5281515   -1.2417862    4.3959084    6.6727695    5.8898783
-    7.627124    -0.66919386 -11.889693    -9.208865    -7.4274073
-    -3.7776625    6.917234    -9.848748    -2.0944717   -5.135116
-    0.49563864   9.317534    -5.9141874   -1.8098574   -0.11738578
-    -7.169265    -1.0578263   -5.7216787   -5.1173844   16.137651
-    -4.473626     7.6624317   -0.55381083   9.631587    -6.4704556
-    -8.548508     4.3716145   -0.79702514   4.478997    -2.9758704
-    3.272176     2.8382776    5.134597    -9.190781    -0.5657382
-    -4.8745747    2.3165567   -5.984303    -2.1798875    0.35541576
-    -0.31784213   9.493548     2.1144536    4.358092   -12.089823
-    8.451689    -7.925461     4.6242585    4.4289427   18.692003
-    -2.6204622   -5.149185    -0.35821092   8.488551     4.981496
-    -9.32683     -2.2544234    6.6417594    1.2119585   10.977129
-    16.555033     3.3238444    9.551863    -1.6676947   -0.79539716
-    -8.605674    -0.47356385   2.6741948   -5.359179    -2.6673796
-    0.66607     15.443222     4.740594    -3.4725387   11.592567
-    -2.054497     1.7361217   -8.265324    -9.30447      5.4068313
-    -1.5180256   -7.746615    -6.089606     0.07112726  -0.34904733
-    -8.649895    -9.998958    -2.564841    -0.53999114   2.601808
-    -0.31927416  -1.8815292   -2.07215     -3.4105783   -8.2998085
-    1.483641   -15.365992    -8.288208     3.8847756   -3.4876456
-    7.3629923    0.4657332    3.132599    12.438889    -1.8337058
-    4.532936     2.7264361   10.145339    -6.521951     2.897153
-    -3.3925855    5.079156     7.759716     4.677565     5.8457737
-    2.402413     7.7071047    3.9711342   -6.390043     6.1268735
-    -3.7760346  -11.118123  ]
+    [ -1.3251206    7.8606825   -4.620626     0.3000721    2.2648535
+    -1.1931441    3.0647137    7.673595    -6.0044727  -12.02426
+    -1.9496069    3.1269536    1.618838    -7.6383104   -1.2299773
+  -12.338331     2.1373026   -5.3957124    9.717328     5.6752305
+    3.7805123    3.0597172    3.429692     8.97601     13.174125
+    -0.53132284   8.9424715    4.46511     -4.4262476   -9.726503
+    8.399328     7.2239175   -7.435854     2.9441683   -4.3430395
+  -13.886965    -1.6346735  -10.9027405   -5.311245     3.8007221
+    3.8976038   -2.1230774   -2.3521194    4.151031    -7.4048667
+    0.13911647   2.4626107    4.9664545    0.9897574    5.4839754
+    -3.3574002   10.1340065   -0.6120171  -10.403095     4.6007543
+    16.00935     -7.7836914   -4.1945305   -6.9368606    1.1789556
+    11.490801     4.2380238    9.550931     8.375046     7.5089145
+    -0.65707296  -0.30051577   2.8406055    3.0828028    0.730817
+    6.148354     0.13766119 -13.424735    -7.7461405   -2.3227983
+    -8.305252     2.9879124  -10.995229     0.15211068  -2.3820348
+    -1.7984174    8.495629    -5.8522367   -3.755498     0.6989711
+    -5.2702994   -2.6188622   -1.8828466   -4.64665     14.078544
+    -0.5495333   10.579158    -3.2160501    9.349004    -4.381078
+  -11.675817    -2.8630207    4.5721755    2.246612    -4.574342
+    1.8610188    2.3767874    5.6257877   -9.784078     0.64967257
+    -1.4579505    0.4263264   -4.9211264   -2.454784     3.4869802
+    -0.42654222   8.341269     1.356552     7.0966883  -13.102829
+    8.016734    -7.1159344    1.8699781    0.208721    14.699384
+    -1.025278    -2.6107233   -2.5082312    8.427193     6.9138527
+    -6.2912464    0.6157366    2.489688    -3.4668267    9.921763
+    11.200815    -0.1966403    7.4916005   -0.62312716  -0.25848144
+    -9.947997    -0.9611041    1.1649219   -2.1907122   -1.5028487
+    -0.51926106  15.165954     2.4649463   -0.9980445    7.4416637
+    -2.0768049    3.5896823   -7.3055434   -7.5620847    4.323335
+    0.0804418   -6.56401     -2.3148053   -1.7642345   -2.4708817
+    -7.675618    -9.548878    -1.0177554    0.16986446   2.5877135
+    -1.8752296   -0.36614323  -6.0493784   -2.3965611   -5.9453387
+    0.9424033  -13.155974    -7.457801     0.14658108  -3.742797
+    5.8414927   -1.2872906    5.5694313   12.57059      1.0939219
+    2.2142086    1.9181576    6.9914207   -5.888139     3.1409824
+    -2.003628     2.4434285    9.973139     5.03668      2.0051203
+    2.8615603    5.860224     2.9176188   -1.6311141    2.0292206
+    -4.070415    -6.831437  ]
  ```

 - Python API
@@ -125,88 +125,88 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  ```bash
  # Vector Result:
   Audio embedding Result:
-    [  1.4217498    5.626253    -5.342073     1.1773866    3.308055
-    1.756596     5.167894    10.80636     -3.8226728   -5.6141334
-    2.623845    -0.8072968    1.9635103   -7.3128724    0.01103897
-    -9.723131     0.6619743   -6.976803    10.213478     7.494748
-    2.9105635    3.8949256    3.7999806    7.1061673   16.905321
-    -7.1493764    8.733103     3.4230042   -4.831653   -11.403367
-    11.232214     7.1274667   -4.2828417    2.452362    -5.130748
-    -18.177666    -2.6116815  -11.000337    -6.7314315    1.6564683
-    0.7618269    1.1253023   -2.083836     4.725744    -8.782597
-    -3.539873     3.814236     5.1420674    2.162061     4.096431
-    -6.4162116   12.747448     1.9429878  -15.152943     6.417416
-    16.097002    -9.716668    -1.9920526   -3.3649497   -1.871939
-    11.567354     3.69788     11.258265     7.442363     9.183411
-    4.5281515   -1.2417862    4.3959084    6.6727695    5.8898783
-    7.627124    -0.66919386 -11.889693    -9.208865    -7.4274073
-    -3.7776625    6.917234    -9.848748    -2.0944717   -5.135116
-    0.49563864   9.317534    -5.9141874   -1.8098574   -0.11738578
-    -7.169265    -1.0578263   -5.7216787   -5.1173844   16.137651
-    -4.473626     7.6624317   -0.55381083   9.631587    -6.4704556
-    -8.548508     4.3716145   -0.79702514   4.478997    -2.9758704
-    3.272176     2.8382776    5.134597    -9.190781    -0.5657382
-    -4.8745747    2.3165567   -5.984303    -2.1798875    0.35541576
-    -0.31784213   9.493548     2.1144536    4.358092   -12.089823
-    8.451689    -7.925461     4.6242585    4.4289427   18.692003
-    -2.6204622   -5.149185    -0.35821092   8.488551     4.981496
-    -9.32683     -2.2544234    6.6417594    1.2119585   10.977129
-    16.555033     3.3238444    9.551863    -1.6676947   -0.79539716
-    -8.605674    -0.47356385   2.6741948   -5.359179    -2.6673796
-    0.66607     15.443222     4.740594    -3.4725387   11.592567
-    -2.054497     1.7361217   -8.265324    -9.30447      5.4068313
-    -1.5180256   -7.746615    -6.089606     0.07112726  -0.34904733
-    -8.649895    -9.998958    -2.564841    -0.53999114   2.601808
-    -0.31927416  -1.8815292   -2.07215     -3.4105783   -8.2998085
-    1.483641   -15.365992    -8.288208     3.8847756   -3.4876456
-    7.3629923    0.4657332    3.132599    12.438889    -1.8337058
-    4.532936     2.7264361   10.145339    -6.521951     2.897153
-    -3.3925855    5.079156     7.759716     4.677565     5.8457737
-    2.402413     7.7071047    3.9711342   -6.390043     6.1268735
-    -3.7760346  -11.118123  ]
+    [ -1.3251206    7.8606825   -4.620626     0.3000721    2.2648535
+      -1.1931441    3.0647137    7.673595    -6.0044727  -12.02426
+      -1.9496069    3.1269536    1.618838    -7.6383104   -1.2299773
+    -12.338331     2.1373026   -5.3957124    9.717328     5.6752305
+      3.7805123    3.0597172    3.429692     8.97601     13.174125
+      -0.53132284   8.9424715    4.46511     -4.4262476   -9.726503
+      8.399328     7.2239175   -7.435854     2.9441683   -4.3430395
+    -13.886965    -1.6346735  -10.9027405   -5.311245     3.8007221
+      3.8976038   -2.1230774   -2.3521194    4.151031    -7.4048667
+      0.13911647   2.4626107    4.9664545    0.9897574    5.4839754
+      -3.3574002   10.1340065   -0.6120171  -10.403095     4.6007543
+      16.00935     -7.7836914   -4.1945305   -6.9368606    1.1789556
+      11.490801     4.2380238    9.550931     8.375046     7.5089145
+      -0.65707296  -0.30051577   2.8406055    3.0828028    0.730817
+      6.148354     0.13766119 -13.424735    -7.7461405   -2.3227983
+      -8.305252     2.9879124  -10.995229     0.15211068  -2.3820348
+      -1.7984174    8.495629    -5.8522367   -3.755498     0.6989711
+      -5.2702994   -2.6188622   -1.8828466   -4.64665     14.078544
+      -0.5495333   10.579158    -3.2160501    9.349004    -4.381078
+    -11.675817    -2.8630207    4.5721755    2.246612    -4.574342
+      1.8610188    2.3767874    5.6257877   -9.784078     0.64967257
+      -1.4579505    0.4263264   -4.9211264   -2.454784     3.4869802
+      -0.42654222   8.341269     1.356552     7.0966883  -13.102829
+      8.016734    -7.1159344    1.8699781    0.208721    14.699384
+      -1.025278    -2.6107233   -2.5082312    8.427193     6.9138527
+      -6.2912464    0.6157366    2.489688    -3.4668267    9.921763
+      11.200815    -0.1966403    7.4916005   -0.62312716  -0.25848144
+      -9.947997    -0.9611041    1.1649219   -2.1907122   -1.5028487
+      -0.51926106  15.165954     2.4649463   -0.9980445    7.4416637
+      -2.0768049    3.5896823   -7.3055434   -7.5620847    4.323335
+      0.0804418   -6.56401     -2.3148053   -1.7642345   -2.4708817
+      -7.675618    -9.548878    -1.0177554    0.16986446   2.5877135
+      -1.8752296   -0.36614323  -6.0493784   -2.3965611   -5.9453387
+      0.9424033  -13.155974    -7.457801     0.14658108  -3.742797
+      5.8414927   -1.2872906    5.5694313   12.57059      1.0939219
+      2.2142086    1.9181576    6.9914207   -5.888139     3.1409824
+      -2.003628     2.4434285    9.973139     5.03668      2.0051203
+      2.8615603    5.860224     2.9176188   -1.6311141    2.0292206
+      -4.070415    -6.831437  ]
    # get the test embedding
    Test embedding Result:
-    [ -1.902964     2.0690894   -8.034194     3.5472693    0.18089125
-      6.9085927    1.4097427   -1.9487704  -10.021278    -0.20755845
-      -8.04332      4.344489     2.3200977  -14.306299     5.184692
-    -11.55602     -3.8497238    0.6444722    1.2833948    2.6766639
-      0.5878921    0.7946299    1.7207596    2.5791872   14.998469
-      -1.3385371   15.031221    -0.8006958    1.99287     -9.52007
-      2.435466     4.003221    -4.33817     -4.898601    -5.304714
-    -18.033886    10.790787   -12.784645    -5.641755     2.9761686
-    -10.566622     1.4839455    6.152458    -5.7195854    2.8603241
-      6.112133     8.489869     5.5958056    1.2836679   -1.2293907
-      0.89927405   7.0288725   -2.854029    -0.9782962    5.8255906
-      14.905906    -5.025907     0.7866458   -4.2444224  -16.354029
-      10.521315     0.9604709   -3.3257897    7.144871   -13.592733
-      -8.568869    -1.7953678    0.26313916  10.916714    -6.9374123
-      1.857403    -6.2746415    2.8154466   -7.2338667   -2.293357
-      -0.05452765   5.4287076    5.0849075   -6.690375    -1.6183422
-      3.654291     0.94352573  -9.200294    -5.4749465   -3.5235846
-      1.3420814    4.240421    -2.772944    -2.8451524   16.311104
-      4.2969875   -1.762936   -12.5758915    8.595198    -0.8835239
-      -1.5708797    1.568961     1.1413603    3.5032008   -0.45251232
-      -6.786333    16.89443      5.3366146   -8.789056     0.6355629
-      3.2579517   -3.328322     7.5969577    0.66025066  -6.550468
-      -9.148656     2.020372    -0.4615173    1.1965656   -3.8764873
-      11.6562195   -6.0750933   12.182899     3.2218833    0.81969476
-      5.570001    -3.8459578   -7.205299     7.9262037   -7.6611166
-      -5.249467    -2.2671914    7.2658715  -13.298164     4.821147
-      -2.7263982   11.691089    -3.8918593   -2.838112    -1.0336838
-      -3.8034165    2.8536487   -5.60398     -1.1972581    1.3455094
-      -3.4903061    2.2408795    5.5010734   -3.970756    11.99696
-      -7.8858757    0.43160373  -5.5059714    4.3426995   16.322706
-      11.635366     0.72157705  -9.245714    -3.91465     -4.449838
-      -1.5716927    7.713747    -2.2430465   -6.198303   -13.481864
-      2.8156567   -5.7812386    5.1456156    2.7289324  -14.505571
-      13.270688     3.448231    -7.0659585    4.5886116   -4.466099
-      -0.296428   -11.463529    -2.6076477   14.110243    -6.9725137
-      -1.9962958    2.7119343   19.391657     0.01961198  14.607133
-      -1.6695905   -4.391516     1.3131028   -6.670972    -5.888604
-      12.0612335    5.9285784    3.3715196    1.492534    10.723728
-      -0.95514804 -12.085431  ]
+    [  2.5247195    5.119042    -4.335273     4.4583654    5.047907
+      3.5059214    1.6159848    0.49364898 -11.6899185   -3.1014526
+      -5.6589785   -0.42684984   2.674276   -11.937654     6.2248464
+    -10.776924    -5.694543     1.112041     1.5709964    1.0961034
+      1.3976512    2.324352     1.339981     5.279319    13.734659
+      -2.5753925   13.651442    -2.2357535    5.1575427   -3.251567
+      1.4023279    6.1191974   -6.0845175   -1.3646189   -2.6789894
+    -15.220778     9.779349    -9.411551    -6.388947     6.8313975
+      -9.245996     0.31196198   2.5509644   -4.413065     6.1649427
+      6.793837     2.6328635    8.620976     3.4832475    0.52491665
+      2.9115407    5.8392377    0.6702376   -3.2726715    2.6694255
+      16.91701     -5.5811176    0.23362345  -4.5573606  -11.801059
+      14.728292    -0.5198082   -3.999922     7.0927105   -7.0459595
+      -5.4389      -0.46420583  -5.1085467   10.376568    -8.889225
+      -0.37705845  -1.659806     2.6731026   -7.1909504    1.4608804
+      -2.163136    -0.17949677   4.0241547    0.11319201   0.601279
+      2.039692     3.1910992  -11.649526    -8.121584    -4.8707457
+      0.3851982    1.4231744   -2.3321972    0.99332285  14.121717
+      5.899413     0.7384519  -17.760096    10.555021     4.1366534
+      -0.3391071   -0.20792882   3.208204     0.8847948   -8.721497
+      -6.432868    13.006379     4.8956      -9.155822    -1.9441519
+      5.7815638   -2.066733    10.425042    -0.8802383   -2.4314315
+      -9.869258     0.35095334  -5.3549943    2.1076174   -8.290468
+      8.4433365   -4.689333     9.334139    -2.172678    -3.0250976
+      8.394216    -3.2110903   -7.93868      2.3960824   -2.3213403
+      -1.4963245   -3.476059     4.132903   -10.893354     4.362673
+      -0.45456508  10.258634    -1.1655927   -6.7799754    0.22885278
+      -4.399287     2.333433    -4.84745     -4.2752337   -1.3577863
+      -1.0685898    9.505196     7.3062205    0.08708266  12.927811
+      -9.57974      1.3936648   -1.9444873    5.776769    15.251903
+      10.6118355   -1.4903594   -9.535318    -3.6553776   -1.6699586
+      -0.5933151    7.600357    -4.8815503   -8.698617   -15.855757
+      0.25632986  -7.2235737    0.9506656    0.7128582   -9.051738
+      8.74869     -1.6426028   -6.5762258    2.506905    -6.7431564
+      5.129912   -12.189555    -3.6435068   12.068113    -6.0059533
+      -2.3535995    2.9014351   22.3082      -1.5563312   13.193291
+      2.7583609   -7.468798     1.3407065   -4.599617    -6.2345777
+      10.7689295    7.137627     5.099476     0.3473359    9.647881
+      -2.0484571   -5.8549366 ]
    # get the score between enroll and test
-    Eembeddings Score: 0.4292638301849365
+    Eembeddings Score: 0.45332613587379456
  ```

 ### 4.预训练模型

--- a/demos/speech_server/README.md
+++ b/demos/speech_server/README.md
@@ -274,12 +274,12 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
  Output:

  ```bash
-    [2022-05-08 00:18:44,249] [    INFO] - vector http client start
-    [2022-05-08 00:18:44,250] [    INFO] - the input audio: 85236145389.wav
-    [2022-05-08 00:18:44,250] [    INFO] - endpoint: http://127.0.0.1:8090/paddlespeech/vector
-    [2022-05-08 00:18:44,250] [    INFO] - http://127.0.0.1:8590/paddlespeech/vector
-    [2022-05-08 00:18:44,406] [    INFO] - The vector: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [1.421751856803894, 5.626245498657227, -5.342077255249023, 1.1773887872695923, 3.3080549240112305, 1.7565933465957642, 5.167886257171631, 10.806358337402344, -3.8226819038391113, -5.614140033721924, 2.6238479614257812, -0.8072972893714905, 1.9635076522827148, -7.312870025634766, 0.011035939678549767, -9.723129272460938, 0.6619706153869629, -6.976806163787842, 10.213476181030273, 7.494769096374512, 2.9105682373046875, 3.8949244022369385, 3.799983501434326, 7.106168746948242, 16.90532875061035, -7.149388313293457, 8.733108520507812, 3.423006296157837, -4.831653594970703, -11.403363227844238, 11.232224464416504, 7.127461910247803, -4.282842636108398, 2.452359437942505, -5.130749702453613, -18.17766761779785, -2.6116831302642822, -11.000344276428223, -6.731433391571045, 1.6564682722091675, 0.7618281245231628, 1.125300407409668, -2.0838370323181152, 4.725743293762207, -8.782588005065918, -3.5398752689361572, 3.8142364025115967, 5.142068862915039, 2.1620609760284424, 4.09643030166626, -6.416214942932129, 12.747446060180664, 1.9429892301559448, -15.15294361114502, 6.417416095733643, 16.09701156616211, -9.716667175292969, -1.9920575618743896, -3.36494779586792, -1.8719440698623657, 11.567351341247559, 3.6978814601898193, 11.258262634277344, 7.442368507385254, 9.183408737182617, 4.528149127960205, -1.2417854070663452, 4.395912170410156, 6.6727728843688965, 5.88988733291626, 7.627128601074219, -0.6691966652870178, -11.889698028564453, -9.20886516571045, -7.42740535736084, -3.777663230895996, 6.917238712310791, -9.848755836486816, -2.0944676399230957, -5.1351165771484375, 0.4956451654434204, 9.317537307739258, -5.914181232452393, -1.809860348701477, -0.11738915741443634, -7.1692705154418945, -1.057827353477478, -5.721670627593994, -5.117385387420654, 16.13765525817871, -4.473617076873779, 7.6624321937561035, -0.55381840467453, 9.631585121154785, -6.470459461212158, -8.548508644104004, 4.371616840362549, -0.7970245480537415, 4.4789886474609375, -2.975860834121704, 3.2721822261810303, 2.838287830352783, 5.134591102600098, -9.19079875946045, -0.5657302737236023, -4.8745832443237305, 2.3165574073791504, -5.984319686889648, -2.1798853874206543, 0.3554139733314514, -0.3178512752056122, 9.493552207946777, 2.1144471168518066, 4.358094692230225, -12.089824676513672, 8.451693534851074, -7.925466537475586, 4.624246597290039, 4.428936958312988, 18.69200897216797, -2.6204581260681152, -5.14918851852417, -0.3582090139389038, 8.488558769226074, 4.98148775100708, -9.326835632324219, -2.2544219493865967, 6.641760349273682, 1.2119598388671875, 10.977124214172363, 16.555034637451172, 3.3238420486450195, 9.551861763000488, -1.6676981449127197, -0.7953944206237793, -8.605667114257812, -0.4735655188560486, 2.674196243286133, -5.359177112579346, -2.66738224029541, 0.6660683155059814, 15.44322681427002, 4.740593433380127, -3.472534418106079, 11.592567443847656, -2.0544962882995605, 1.736127495765686, -8.265326499938965, -9.30447769165039, 5.406829833984375, -1.518022894859314, -7.746612548828125, -6.089611053466797, 0.07112743705511093, -0.3490503430366516, -8.64989185333252, -9.998957633972168, -2.564845085144043, -0.5399947762489319, 2.6018123626708984, -0.3192799389362335, -1.8815255165100098, -2.0721492767333984, -3.410574436187744, -8.29980754852295, 1.483638048171997, -15.365986824035645, -8.288211822509766, 3.884779930114746, -3.4876468181610107, 7.362999439239502, 0.4657334089279175, 3.1326050758361816, 12.438895225524902, -1.8337041139602661, 4.532927989959717, 2.7264339923858643, 10.14534854888916, -6.521963596343994, 2.897155523300171, -3.392582654953003, 5.079153060913086, 7.7597246170043945, 4.677570819854736, 5.845779895782471, 2.402411460876465, 7.7071051597595215, 3.9711380004882812, -6.39003849029541, 6.12687873840332, -3.776029348373413, -11.118121147155762]}}
-    [2022-05-08 00:18:44,406] [    INFO] - Response time 0.156481 s.
+    [2022-05-25 12:25:36,165] [    INFO] - vector http client start
+    [2022-05-25 12:25:36,165] [    INFO] - the input audio: 85236145389.wav
+    [2022-05-25 12:25:36,165] [    INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector
+    [2022-05-25 12:25:36,166] [    INFO] - http://127.0.0.1:8790/paddlespeech/vector
+    [2022-05-25 12:25:36,324] [    INFO] - The vector: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [-1.3251205682754517, 7.860682487487793, -4.620625972747803, 0.3000721037387848, 2.2648534774780273, -1.1931440830230713, 3.064713716506958, 7.673594951629639, -6.004472732543945, -12.024259567260742, -1.9496068954467773, 3.126953601837158, 1.6188379526138306, -7.638310432434082, -1.2299772500991821, -12.33833122253418, 2.1373026371002197, -5.395712375640869, 9.717328071594238, 5.675230503082275, 3.7805123329162598, 3.0597171783447266, 3.429692029953003, 8.9760103225708, 13.174124717712402, -0.5313228368759155, 8.942471504211426, 4.465109825134277, -4.426247596740723, -9.726503372192383, 8.399328231811523, 7.223917484283447, -7.435853958129883, 2.9441683292388916, -4.343039512634277, -13.886964797973633, -1.6346734762191772, -10.902740478515625, -5.311244964599609, 3.800722122192383, 3.897603750228882, -2.123077392578125, -2.3521194458007812, 4.151031017303467, -7.404866695404053, 0.13911646604537964, 2.4626107215881348, 4.96645450592041, 0.9897574186325073, 5.483975410461426, -3.3574001789093018, 10.13400650024414, -0.6120170950889587, -10.403095245361328, 4.600754261016846, 16.009349822998047, -7.78369140625, -4.194530487060547, -6.93686056137085, 1.1789555549621582, 11.490800857543945, 4.23802375793457, 9.550930976867676, 8.375045776367188, 7.508914470672607, -0.6570729613304138, -0.3005157709121704, 2.8406054973602295, 3.0828027725219727, 0.7308170199394226, 6.1483540534973145, 0.1376611888408661, -13.424735069274902, -7.746140480041504, -2.322798252105713, -8.305252075195312, 2.98791241645813, -10.99522876739502, 0.15211068093776703, -2.3820347785949707, -1.7984174489974976, 8.49562931060791, -5.852236747741699, -3.755497932434082, 0.6989710927009583, -5.270299434661865, -2.6188621520996094, -1.8828465938568115, -4.6466498374938965, 14.078543663024902, -0.5495333075523376, 10.579157829284668, -3.216050148010254, 9.349003791809082, -4.381077766418457, -11.675816535949707, -2.863020658493042, 4.5721755027771, 2.246612071990967, -4.574341773986816, 1.8610187768936157, 2.3767874240875244, 5.625787734985352, -9.784077644348145, 0.6496725678443909, -1.457950472831726, 0.4263263940811157, -4.921126365661621, -2.4547839164733887, 3.4869801998138428, -0.4265422224998474, 8.341268539428711, 1.356552004814148, 7.096688270568848, -13.102828979492188, 8.01673412322998, -7.115934371948242, 1.8699780702590942, 0.20872099697589874, 14.699383735656738, -1.0252779722213745, -2.6107232570648193, -2.5082311630249023, 8.427192687988281, 6.913852691650391, -6.29124641418457, 0.6157366037368774, 2.489687919616699, -3.4668266773223877, 9.92176342010498, 11.200815200805664, -0.19664029777050018, 7.491600513458252, -0.6231271624565125, -0.2584814429283142, -9.947997093200684, -0.9611040949821472, 1.1649218797683716, -2.1907122135162354, -1.502848744392395, -0.5192610621452332, 15.165953636169434, 2.4649462699890137, -0.998044490814209, 7.44166374206543, -2.0768048763275146, 3.5896823406219482, -7.305543422698975, -7.562084674835205, 4.32333517074585, 0.08044180274009705, -6.564010143280029, -2.314805269241333, -1.7642345428466797, -2.470881700515747, -7.6756181716918945, -9.548877716064453, -1.017755389213562, 0.1698644608259201, 2.5877134799957275, -1.8752295970916748, -0.36614322662353516, -6.049378395080566, -2.3965611457824707, -5.945338726043701, 0.9424033164978027, -13.155974388122559, -7.45780086517334, 0.14658108353614807, -3.7427968978881836, 5.841492652893066, -1.2872905731201172, 5.569431304931641, 12.570590019226074, 1.0939218997955322, 2.2142086029052734, 1.9181575775146484, 6.991420745849609, -5.888138771057129, 3.1409823894500732, -2.0036280155181885, 2.4434285163879395, 9.973138809204102, 5.036680221557617, 2.005120277404785, 2.861560344696045, 5.860223770141602, 2.917618751525879, -1.63111412525177, 2.0292205810546875, -4.070415019989014, -6.831437110900879]}}
+    [2022-05-25 12:25:36,324] [    INFO] - Response time 0.159053 s.
  ```

 * Python API
@@ -299,7 +299,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
  Output:

  ``` bash
-    {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [1.421751856803894, 5.626245498657227, -5.342077255249023, 1.1773887872695923, 3.3080549240112305, 1.7565933465957642, 5.167886257171631, 10.806358337402344, -3.8226819038391113, -5.614140033721924, 2.6238479614257812, -0.8072972893714905, 1.9635076522827148, -7.312870025634766, 0.011035939678549767, -9.723129272460938, 0.6619706153869629, -6.976806163787842, 10.213476181030273, 7.494769096374512, 2.9105682373046875, 3.8949244022369385, 3.799983501434326, 7.106168746948242, 16.90532875061035, -7.149388313293457, 8.733108520507812, 3.423006296157837, -4.831653594970703, -11.403363227844238, 11.232224464416504, 7.127461910247803, -4.282842636108398, 2.452359437942505, -5.130749702453613, -18.17766761779785, -2.6116831302642822, -11.000344276428223, -6.731433391571045, 1.6564682722091675, 0.7618281245231628, 1.125300407409668, -2.0838370323181152, 4.725743293762207, -8.782588005065918, -3.5398752689361572, 3.8142364025115967, 5.142068862915039, 2.1620609760284424, 4.09643030166626, -6.416214942932129, 12.747446060180664, 1.9429892301559448, -15.15294361114502, 6.417416095733643, 16.09701156616211, -9.716667175292969, -1.9920575618743896, -3.36494779586792, -1.8719440698623657, 11.567351341247559, 3.6978814601898193, 11.258262634277344, 7.442368507385254, 9.183408737182617, 4.528149127960205, -1.2417854070663452, 4.395912170410156, 6.6727728843688965, 5.88988733291626, 7.627128601074219, -0.6691966652870178, -11.889698028564453, -9.20886516571045, -7.42740535736084, -3.777663230895996, 6.917238712310791, -9.848755836486816, -2.0944676399230957, -5.1351165771484375, 0.4956451654434204, 9.317537307739258, -5.914181232452393, -1.809860348701477, -0.11738915741443634, -7.1692705154418945, -1.057827353477478, -5.721670627593994, -5.117385387420654, 16.13765525817871, -4.473617076873779, 7.6624321937561035, -0.55381840467453, 9.631585121154785, -6.470459461212158, -8.548508644104004, 4.371616840362549, -0.7970245480537415, 4.4789886474609375, -2.975860834121704, 3.2721822261810303, 2.838287830352783, 5.134591102600098, -9.19079875946045, -0.5657302737236023, -4.8745832443237305, 2.3165574073791504, -5.984319686889648, -2.1798853874206543, 0.3554139733314514, -0.3178512752056122, 9.493552207946777, 2.1144471168518066, 4.358094692230225, -12.089824676513672, 8.451693534851074, -7.925466537475586, 4.624246597290039, 4.428936958312988, 18.69200897216797, -2.6204581260681152, -5.14918851852417, -0.3582090139389038, 8.488558769226074, 4.98148775100708, -9.326835632324219, -2.2544219493865967, 6.641760349273682, 1.2119598388671875, 10.977124214172363, 16.555034637451172, 3.3238420486450195, 9.551861763000488, -1.6676981449127197, -0.7953944206237793, -8.605667114257812, -0.4735655188560486, 2.674196243286133, -5.359177112579346, -2.66738224029541, 0.6660683155059814, 15.44322681427002, 4.740593433380127, -3.472534418106079, 11.592567443847656, -2.0544962882995605, 1.736127495765686, -8.265326499938965, -9.30447769165039, 5.406829833984375, -1.518022894859314, -7.746612548828125, -6.089611053466797, 0.07112743705511093, -0.3490503430366516, -8.64989185333252, -9.998957633972168, -2.564845085144043, -0.5399947762489319, 2.6018123626708984, -0.3192799389362335, -1.8815255165100098, -2.0721492767333984, -3.410574436187744, -8.29980754852295, 1.483638048171997, -15.365986824035645, -8.288211822509766, 3.884779930114746, -3.4876468181610107, 7.362999439239502, 0.4657334089279175, 3.1326050758361816, 12.438895225524902, -1.8337041139602661, 4.532927989959717, 2.7264339923858643, 10.14534854888916, -6.521963596343994, 2.897155523300171, -3.392582654953003, 5.079153060913086, 7.7597246170043945, 4.677570819854736, 5.845779895782471, 2.402411460876465, 7.7071051597595215, 3.9711380004882812, -6.39003849029541, 6.12687873840332, -3.776029348373413, -11.118121147155762]}}
+    {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [-1.3251205682754517, 7.860682487487793, -4.620625972747803, 0.3000721037387848, 2.2648534774780273, -1.1931440830230713, 3.064713716506958, 7.673594951629639, -6.004472732543945, -12.024259567260742, -1.9496068954467773, 3.126953601837158, 1.6188379526138306, -7.638310432434082, -1.2299772500991821, -12.33833122253418, 2.1373026371002197, -5.395712375640869, 9.717328071594238, 5.675230503082275, 3.7805123329162598, 3.0597171783447266, 3.429692029953003, 8.9760103225708, 13.174124717712402, -0.5313228368759155, 8.942471504211426, 4.465109825134277, -4.426247596740723, -9.726503372192383, 8.399328231811523, 7.223917484283447, -7.435853958129883, 2.9441683292388916, -4.343039512634277, -13.886964797973633, -1.6346734762191772, -10.902740478515625, -5.311244964599609, 3.800722122192383, 3.897603750228882, -2.123077392578125, -2.3521194458007812, 4.151031017303467, -7.404866695404053, 0.13911646604537964, 2.4626107215881348, 4.96645450592041, 0.9897574186325073, 5.483975410461426, -3.3574001789093018, 10.13400650024414, -0.6120170950889587, -10.403095245361328, 4.600754261016846, 16.009349822998047, -7.78369140625, -4.194530487060547, -6.93686056137085, 1.1789555549621582, 11.490800857543945, 4.23802375793457, 9.550930976867676, 8.375045776367188, 7.508914470672607, -0.6570729613304138, -0.3005157709121704, 2.8406054973602295, 3.0828027725219727, 0.7308170199394226, 6.1483540534973145, 0.1376611888408661, -13.424735069274902, -7.746140480041504, -2.322798252105713, -8.305252075195312, 2.98791241645813, -10.99522876739502, 0.15211068093776703, -2.3820347785949707, -1.7984174489974976, 8.49562931060791, -5.852236747741699, -3.755497932434082, 0.6989710927009583, -5.270299434661865, -2.6188621520996094, -1.8828465938568115, -4.6466498374938965, 14.078543663024902, -0.5495333075523376, 10.579157829284668, -3.216050148010254, 9.349003791809082, -4.381077766418457, -11.675816535949707, -2.863020658493042, 4.5721755027771, 2.246612071990967, -4.574341773986816, 1.8610187768936157, 2.3767874240875244, 5.625787734985352, -9.784077644348145, 0.6496725678443909, -1.457950472831726, 0.4263263940811157, -4.921126365661621, -2.4547839164733887, 3.4869801998138428, -0.4265422224998474, 8.341268539428711, 1.356552004814148, 7.096688270568848, -13.102828979492188, 8.01673412322998, -7.115934371948242, 1.8699780702590942, 0.20872099697589874, 14.699383735656738, -1.0252779722213745, -2.6107232570648193, -2.5082311630249023, 8.427192687988281, 6.913852691650391, -6.29124641418457, 0.6157366037368774, 2.489687919616699, -3.4668266773223877, 9.92176342010498, 11.200815200805664, -0.19664029777050018, 7.491600513458252, -0.6231271624565125, -0.2584814429283142, -9.947997093200684, -0.9611040949821472, 1.1649218797683716, -2.1907122135162354, -1.502848744392395, -0.5192610621452332, 15.165953636169434, 2.4649462699890137, -0.998044490814209, 7.44166374206543, -2.0768048763275146, 3.5896823406219482, -7.305543422698975, -7.562084674835205, 4.32333517074585, 0.08044180274009705, -6.564010143280029, -2.314805269241333, -1.7642345428466797, -2.470881700515747, -7.6756181716918945, -9.548877716064453, -1.017755389213562, 0.1698644608259201, 2.5877134799957275, -1.8752295970916748, -0.36614322662353516, -6.049378395080566, -2.3965611457824707, -5.945338726043701, 0.9424033164978027, -13.155974388122559, -7.45780086517334, 0.14658108353614807, -3.7427968978881836, 5.841492652893066, -1.2872905731201172, 5.569431304931641, 12.570590019226074, 1.0939218997955322, 2.2142086029052734, 1.9181575775146484, 6.991420745849609, -5.888138771057129, 3.1409823894500732, -2.0036280155181885, 2.4434285163879395, 9.973138809204102, 5.036680221557617, 2.005120277404785, 2.861560344696045, 5.860223770141602, 2.917618751525879, -1.63111412525177, 2.0292205810546875, -4.070415019989014, -6.831437110900879]}}
  ```

 #### 7.2 Get the score between speaker audio embedding
@@ -331,12 +331,12 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
  Output:

  ``` bash
-    [2022-05-09 10:28:40,556] [    INFO] - vector score http client start
-    [2022-05-09 10:28:40,556] [    INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav
-    [2022-05-09 10:28:40,556] [    INFO] - endpoint: http://127.0.0.1:8090/paddlespeech/vector/score
-    [2022-05-09 10:28:40,731] [    INFO] - The vector score is: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.4292638897895813}}
-    [2022-05-09 10:28:40,731] [    INFO] - The vector: None
-    [2022-05-09 10:28:40,731] [    INFO] - Response time 0.175514 s.
+    [2022-05-25 12:33:24,527] [    INFO] - vector score http client start
+    [2022-05-25 12:33:24,527] [    INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav
+    [2022-05-25 12:33:24,528] [    INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector/score
+    [2022-05-25 12:33:24,695] [    INFO] - The vector score is: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}}
+    [2022-05-25 12:33:24,696] [    INFO] - The vector: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}}
+    [2022-05-25 12:33:24,696] [    INFO] - Response time 0.168271 s.
  ```

 * Python API
@@ -358,10 +358,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
  Output:

  ``` bash
-  [2022-05-09 10:34:54,769] [    INFO] - vector score http client start
-  [2022-05-09 10:34:54,771] [    INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav
-  [2022-05-09 10:34:54,771] [    INFO] - endpoint: http://127.0.0.1:8090/paddlespeech/vector/score
-  [2022-05-09 10:34:55,026] [    INFO] - The vector score is: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.4292638897895813}}
+  [2022-05-25 12:30:14,143] [    INFO] - vector score http client start
+  [2022-05-25 12:30:14,143] [    INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav
+  [2022-05-25 12:30:14,143] [    INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector/score
+  [2022-05-25 12:30:14,363] [    INFO] - The vector score is: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}}
+  {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}}
  ```

 ### 8. Punctuation prediction

--- a/demos/speech_server/README_cn.md
+++ b/demos/speech_server/README_cn.md
@@ -277,12 +277,12 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
  输出:

  ``` bash
-    [2022-05-08 00:18:44,249] [    INFO] - vector http client start
-    [2022-05-08 00:18:44,250] [    INFO] - the input audio: 85236145389.wav
-    [2022-05-08 00:18:44,250] [    INFO] - endpoint: http://127.0.0.1:8090/paddlespeech/vector
-    [2022-05-08 00:18:44,250] [    INFO] - http://127.0.0.1:8590/paddlespeech/vector
-    [2022-05-08 00:18:44,406] [    INFO] - The vector: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [1.421751856803894, 5.626245498657227, -5.342077255249023, 1.1773887872695923, 3.3080549240112305, 1.7565933465957642, 5.167886257171631, 10.806358337402344, -3.8226819038391113, -5.614140033721924, 2.6238479614257812, -0.8072972893714905, 1.9635076522827148, -7.312870025634766, 0.011035939678549767, -9.723129272460938, 0.6619706153869629, -6.976806163787842, 10.213476181030273, 7.494769096374512, 2.9105682373046875, 3.8949244022369385, 3.799983501434326, 7.106168746948242, 16.90532875061035, -7.149388313293457, 8.733108520507812, 3.423006296157837, -4.831653594970703, -11.403363227844238, 11.232224464416504, 7.127461910247803, -4.282842636108398, 2.452359437942505, -5.130749702453613, -18.17766761779785, -2.6116831302642822, -11.000344276428223, -6.731433391571045, 1.6564682722091675, 0.7618281245231628, 1.125300407409668, -2.0838370323181152, 4.725743293762207, -8.782588005065918, -3.5398752689361572, 3.8142364025115967, 5.142068862915039, 2.1620609760284424, 4.09643030166626, -6.416214942932129, 12.747446060180664, 1.9429892301559448, -15.15294361114502, 6.417416095733643, 16.09701156616211, -9.716667175292969, -1.9920575618743896, -3.36494779586792, -1.8719440698623657, 11.567351341247559, 3.6978814601898193, 11.258262634277344, 7.442368507385254, 9.183408737182617, 4.528149127960205, -1.2417854070663452, 4.395912170410156, 6.6727728843688965, 5.88988733291626, 7.627128601074219, -0.6691966652870178, -11.889698028564453, -9.20886516571045, -7.42740535736084, -3.777663230895996, 6.917238712310791, -9.848755836486816, -2.0944676399230957, -5.1351165771484375, 0.4956451654434204, 9.317537307739258, -5.914181232452393, -1.809860348701477, -0.11738915741443634, -7.1692705154418945, -1.057827353477478, -5.721670627593994, -5.117385387420654, 16.13765525817871, -4.473617076873779, 7.6624321937561035, -0.55381840467453, 9.631585121154785, -6.470459461212158, -8.548508644104004, 4.371616840362549, -0.7970245480537415, 4.4789886474609375, -2.975860834121704, 3.2721822261810303, 2.838287830352783, 5.134591102600098, -9.19079875946045, -0.5657302737236023, -4.8745832443237305, 2.3165574073791504, -5.984319686889648, -2.1798853874206543, 0.3554139733314514, -0.3178512752056122, 9.493552207946777, 2.1144471168518066, 4.358094692230225, -12.089824676513672, 8.451693534851074, -7.925466537475586, 4.624246597290039, 4.428936958312988, 18.69200897216797, -2.6204581260681152, -5.14918851852417, -0.3582090139389038, 8.488558769226074, 4.98148775100708, -9.326835632324219, -2.2544219493865967, 6.641760349273682, 1.2119598388671875, 10.977124214172363, 16.555034637451172, 3.3238420486450195, 9.551861763000488, -1.6676981449127197, -0.7953944206237793, -8.605667114257812, -0.4735655188560486, 2.674196243286133, -5.359177112579346, -2.66738224029541, 0.6660683155059814, 15.44322681427002, 4.740593433380127, -3.472534418106079, 11.592567443847656, -2.0544962882995605, 1.736127495765686, -8.265326499938965, -9.30447769165039, 5.406829833984375, -1.518022894859314, -7.746612548828125, -6.089611053466797, 0.07112743705511093, -0.3490503430366516, -8.64989185333252, -9.998957633972168, -2.564845085144043, -0.5399947762489319, 2.6018123626708984, -0.3192799389362335, -1.8815255165100098, -2.0721492767333984, -3.410574436187744, -8.29980754852295, 1.483638048171997, -15.365986824035645, -8.288211822509766, 3.884779930114746, -3.4876468181610107, 7.362999439239502, 0.4657334089279175, 3.1326050758361816, 12.438895225524902, -1.8337041139602661, 4.532927989959717, 2.7264339923858643, 10.14534854888916, -6.521963596343994, 2.897155523300171, -3.392582654953003, 5.079153060913086, 7.7597246170043945, 4.677570819854736, 5.845779895782471, 2.402411460876465, 7.7071051597595215, 3.9711380004882812, -6.39003849029541, 6.12687873840332, -3.776029348373413, -11.118121147155762]}}
-    [2022-05-08 00:18:44,406] [    INFO] - Response time 0.156481 s.
+    [2022-05-25 12:25:36,165] [    INFO] - vector http client start
+    [2022-05-25 12:25:36,165] [    INFO] - the input audio: 85236145389.wav
+    [2022-05-25 12:25:36,165] [    INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector
+    [2022-05-25 12:25:36,166] [    INFO] - http://127.0.0.1:8790/paddlespeech/vector
+    [2022-05-25 12:25:36,324] [    INFO] - The vector: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [-1.3251205682754517, 7.860682487487793, -4.620625972747803, 0.3000721037387848, 2.2648534774780273, -1.1931440830230713, 3.064713716506958, 7.673594951629639, -6.004472732543945, -12.024259567260742, -1.9496068954467773, 3.126953601837158, 1.6188379526138306, -7.638310432434082, -1.2299772500991821, -12.33833122253418, 2.1373026371002197, -5.395712375640869, 9.717328071594238, 5.675230503082275, 3.7805123329162598, 3.0597171783447266, 3.429692029953003, 8.9760103225708, 13.174124717712402, -0.5313228368759155, 8.942471504211426, 4.465109825134277, -4.426247596740723, -9.726503372192383, 8.399328231811523, 7.223917484283447, -7.435853958129883, 2.9441683292388916, -4.343039512634277, -13.886964797973633, -1.6346734762191772, -10.902740478515625, -5.311244964599609, 3.800722122192383, 3.897603750228882, -2.123077392578125, -2.3521194458007812, 4.151031017303467, -7.404866695404053, 0.13911646604537964, 2.4626107215881348, 4.96645450592041, 0.9897574186325073, 5.483975410461426, -3.3574001789093018, 10.13400650024414, -0.6120170950889587, -10.403095245361328, 4.600754261016846, 16.009349822998047, -7.78369140625, -4.194530487060547, -6.93686056137085, 1.1789555549621582, 11.490800857543945, 4.23802375793457, 9.550930976867676, 8.375045776367188, 7.508914470672607, -0.6570729613304138, -0.3005157709121704, 2.8406054973602295, 3.0828027725219727, 0.7308170199394226, 6.1483540534973145, 0.1376611888408661, -13.424735069274902, -7.746140480041504, -2.322798252105713, -8.305252075195312, 2.98791241645813, -10.99522876739502, 0.15211068093776703, -2.3820347785949707, -1.7984174489974976, 8.49562931060791, -5.852236747741699, -3.755497932434082, 0.6989710927009583, -5.270299434661865, -2.6188621520996094, -1.8828465938568115, -4.6466498374938965, 14.078543663024902, -0.5495333075523376, 10.579157829284668, -3.216050148010254, 9.349003791809082, -4.381077766418457, -11.675816535949707, -2.863020658493042, 4.5721755027771, 2.246612071990967, -4.574341773986816, 1.8610187768936157, 2.3767874240875244, 5.625787734985352, -9.784077644348145, 0.6496725678443909, -1.457950472831726, 0.4263263940811157, -4.921126365661621, -2.4547839164733887, 3.4869801998138428, -0.4265422224998474, 8.341268539428711, 1.356552004814148, 7.096688270568848, -13.102828979492188, 8.01673412322998, -7.115934371948242, 1.8699780702590942, 0.20872099697589874, 14.699383735656738, -1.0252779722213745, -2.6107232570648193, -2.5082311630249023, 8.427192687988281, 6.913852691650391, -6.29124641418457, 0.6157366037368774, 2.489687919616699, -3.4668266773223877, 9.92176342010498, 11.200815200805664, -0.19664029777050018, 7.491600513458252, -0.6231271624565125, -0.2584814429283142, -9.947997093200684, -0.9611040949821472, 1.1649218797683716, -2.1907122135162354, -1.502848744392395, -0.5192610621452332, 15.165953636169434, 2.4649462699890137, -0.998044490814209, 7.44166374206543, -2.0768048763275146, 3.5896823406219482, -7.305543422698975, -7.562084674835205, 4.32333517074585, 0.08044180274009705, -6.564010143280029, -2.314805269241333, -1.7642345428466797, -2.470881700515747, -7.6756181716918945, -9.548877716064453, -1.017755389213562, 0.1698644608259201, 2.5877134799957275, -1.8752295970916748, -0.36614322662353516, -6.049378395080566, -2.3965611457824707, -5.945338726043701, 0.9424033164978027, -13.155974388122559, -7.45780086517334, 0.14658108353614807, -3.7427968978881836, 5.841492652893066, -1.2872905731201172, 5.569431304931641, 12.570590019226074, 1.0939218997955322, 2.2142086029052734, 1.9181575775146484, 6.991420745849609, -5.888138771057129, 3.1409823894500732, -2.0036280155181885, 2.4434285163879395, 9.973138809204102, 5.036680221557617, 2.005120277404785, 2.861560344696045, 5.860223770141602, 2.917618751525879, -1.63111412525177, 2.0292205810546875, -4.070415019989014, -6.831437110900879]}}
+    [2022-05-25 12:25:36,324] [    INFO] - Response time 0.159053 s.
  ```

 * Python API
@@ -302,7 +302,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
  输出:

  ``` bash
-    {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [1.421751856803894, 5.626245498657227, -5.342077255249023, 1.1773887872695923, 3.3080549240112305, 1.7565933465957642, 5.167886257171631, 10.806358337402344, -3.8226819038391113, -5.614140033721924, 2.6238479614257812, -0.8072972893714905, 1.9635076522827148, -7.312870025634766, 0.011035939678549767, -9.723129272460938, 0.6619706153869629, -6.976806163787842, 10.213476181030273, 7.494769096374512, 2.9105682373046875, 3.8949244022369385, 3.799983501434326, 7.106168746948242, 16.90532875061035, -7.149388313293457, 8.733108520507812, 3.423006296157837, -4.831653594970703, -11.403363227844238, 11.232224464416504, 7.127461910247803, -4.282842636108398, 2.452359437942505, -5.130749702453613, -18.17766761779785, -2.6116831302642822, -11.000344276428223, -6.731433391571045, 1.6564682722091675, 0.7618281245231628, 1.125300407409668, -2.0838370323181152, 4.725743293762207, -8.782588005065918, -3.5398752689361572, 3.8142364025115967, 5.142068862915039, 2.1620609760284424, 4.09643030166626, -6.416214942932129, 12.747446060180664, 1.9429892301559448, -15.15294361114502, 6.417416095733643, 16.09701156616211, -9.716667175292969, -1.9920575618743896, -3.36494779586792, -1.8719440698623657, 11.567351341247559, 3.6978814601898193, 11.258262634277344, 7.442368507385254, 9.183408737182617, 4.528149127960205, -1.2417854070663452, 4.395912170410156, 6.6727728843688965, 5.88988733291626, 7.627128601074219, -0.6691966652870178, -11.889698028564453, -9.20886516571045, -7.42740535736084, -3.777663230895996, 6.917238712310791, -9.848755836486816, -2.0944676399230957, -5.1351165771484375, 0.4956451654434204, 9.317537307739258, -5.914181232452393, -1.809860348701477, -0.11738915741443634, -7.1692705154418945, -1.057827353477478, -5.721670627593994, -5.117385387420654, 16.13765525817871, -4.473617076873779, 7.6624321937561035, -0.55381840467453, 9.631585121154785, -6.470459461212158, -8.548508644104004, 4.371616840362549, -0.7970245480537415, 4.4789886474609375, -2.975860834121704, 3.2721822261810303, 2.838287830352783, 5.134591102600098, -9.19079875946045, -0.5657302737236023, -4.8745832443237305, 2.3165574073791504, -5.984319686889648, -2.1798853874206543, 0.3554139733314514, -0.3178512752056122, 9.493552207946777, 2.1144471168518066, 4.358094692230225, -12.089824676513672, 8.451693534851074, -7.925466537475586, 4.624246597290039, 4.428936958312988, 18.69200897216797, -2.6204581260681152, -5.14918851852417, -0.3582090139389038, 8.488558769226074, 4.98148775100708, -9.326835632324219, -2.2544219493865967, 6.641760349273682, 1.2119598388671875, 10.977124214172363, 16.555034637451172, 3.3238420486450195, 9.551861763000488, -1.6676981449127197, -0.7953944206237793, -8.605667114257812, -0.4735655188560486, 2.674196243286133, -5.359177112579346, -2.66738224029541, 0.6660683155059814, 15.44322681427002, 4.740593433380127, -3.472534418106079, 11.592567443847656, -2.0544962882995605, 1.736127495765686, -8.265326499938965, -9.30447769165039, 5.406829833984375, -1.518022894859314, -7.746612548828125, -6.089611053466797, 0.07112743705511093, -0.3490503430366516, -8.64989185333252, -9.998957633972168, -2.564845085144043, -0.5399947762489319, 2.6018123626708984, -0.3192799389362335, -1.8815255165100098, -2.0721492767333984, -3.410574436187744, -8.29980754852295, 1.483638048171997, -15.365986824035645, -8.288211822509766, 3.884779930114746, -3.4876468181610107, 7.362999439239502, 0.4657334089279175, 3.1326050758361816, 12.438895225524902, -1.8337041139602661, 4.532927989959717, 2.7264339923858643, 10.14534854888916, -6.521963596343994, 2.897155523300171, -3.392582654953003, 5.079153060913086, 7.7597246170043945, 4.677570819854736, 5.845779895782471, 2.402411460876465, 7.7071051597595215, 3.9711380004882812, -6.39003849029541, 6.12687873840332, -3.776029348373413, -11.118121147155762]}}
+    {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [-1.3251205682754517, 7.860682487487793, -4.620625972747803, 0.3000721037387848, 2.2648534774780273, -1.1931440830230713, 3.064713716506958, 7.673594951629639, -6.004472732543945, -12.024259567260742, -1.9496068954467773, 3.126953601837158, 1.6188379526138306, -7.638310432434082, -1.2299772500991821, -12.33833122253418, 2.1373026371002197, -5.395712375640869, 9.717328071594238, 5.675230503082275, 3.7805123329162598, 3.0597171783447266, 3.429692029953003, 8.9760103225708, 13.174124717712402, -0.5313228368759155, 8.942471504211426, 4.465109825134277, -4.426247596740723, -9.726503372192383, 8.399328231811523, 7.223917484283447, -7.435853958129883, 2.9441683292388916, -4.343039512634277, -13.886964797973633, -1.6346734762191772, -10.902740478515625, -5.311244964599609, 3.800722122192383, 3.897603750228882, -2.123077392578125, -2.3521194458007812, 4.151031017303467, -7.404866695404053, 0.13911646604537964, 2.4626107215881348, 4.96645450592041, 0.9897574186325073, 5.483975410461426, -3.3574001789093018, 10.13400650024414, -0.6120170950889587, -10.403095245361328, 4.600754261016846, 16.009349822998047, -7.78369140625, -4.194530487060547, -6.93686056137085, 1.1789555549621582, 11.490800857543945, 4.23802375793457, 9.550930976867676, 8.375045776367188, 7.508914470672607, -0.6570729613304138, -0.3005157709121704, 2.8406054973602295, 3.0828027725219727, 0.7308170199394226, 6.1483540534973145, 0.1376611888408661, -13.424735069274902, -7.746140480041504, -2.322798252105713, -8.305252075195312, 2.98791241645813, -10.99522876739502, 0.15211068093776703, -2.3820347785949707, -1.7984174489974976, 8.49562931060791, -5.852236747741699, -3.755497932434082, 0.6989710927009583, -5.270299434661865, -2.6188621520996094, -1.8828465938568115, -4.6466498374938965, 14.078543663024902, -0.5495333075523376, 10.579157829284668, -3.216050148010254, 9.349003791809082, -4.381077766418457, -11.675816535949707, -2.863020658493042, 4.5721755027771, 2.246612071990967, -4.574341773986816, 1.8610187768936157, 2.3767874240875244, 5.625787734985352, -9.784077644348145, 0.6496725678443909, -1.457950472831726, 0.4263263940811157, -4.921126365661621, -2.4547839164733887, 3.4869801998138428, -0.4265422224998474, 8.341268539428711, 1.356552004814148, 7.096688270568848, -13.102828979492188, 8.01673412322998, -7.115934371948242, 1.8699780702590942, 0.20872099697589874, 14.699383735656738, -1.0252779722213745, -2.6107232570648193, -2.5082311630249023, 8.427192687988281, 6.913852691650391, -6.29124641418457, 0.6157366037368774, 2.489687919616699, -3.4668266773223877, 9.92176342010498, 11.200815200805664, -0.19664029777050018, 7.491600513458252, -0.6231271624565125, -0.2584814429283142, -9.947997093200684, -0.9611040949821472, 1.1649218797683716, -2.1907122135162354, -1.502848744392395, -0.5192610621452332, 15.165953636169434, 2.4649462699890137, -0.998044490814209, 7.44166374206543, -2.0768048763275146, 3.5896823406219482, -7.305543422698975, -7.562084674835205, 4.32333517074585, 0.08044180274009705, -6.564010143280029, -2.314805269241333, -1.7642345428466797, -2.470881700515747, -7.6756181716918945, -9.548877716064453, -1.017755389213562, 0.1698644608259201, 2.5877134799957275, -1.8752295970916748, -0.36614322662353516, -6.049378395080566, -2.3965611457824707, -5.945338726043701, 0.9424033164978027, -13.155974388122559, -7.45780086517334, 0.14658108353614807, -3.7427968978881836, 5.841492652893066, -1.2872905731201172, 5.569431304931641, 12.570590019226074, 1.0939218997955322, 2.2142086029052734, 1.9181575775146484, 6.991420745849609, -5.888138771057129, 3.1409823894500732, -2.0036280155181885, 2.4434285163879395, 9.973138809204102, 5.036680221557617, 2.005120277404785, 2.861560344696045, 5.860223770141602, 2.917618751525879, -1.63111412525177, 2.0292205810546875, -4.070415019989014, -6.831437110900879]}}
  ```

 #### 7.2 音频声纹打分
@@ -333,12 +333,12 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
  输出:

  ``` bash
-    [2022-05-09 10:28:40,556] [    INFO] - vector score http client start
-    [2022-05-09 10:28:40,556] [    INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav
-    [2022-05-09 10:28:40,556] [    INFO] - endpoint: http://127.0.0.1:8090/paddlespeech/vector/score
-    [2022-05-09 10:28:40,731] [    INFO] - The vector score is: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.4292638897895813}}
-    [2022-05-09 10:28:40,731] [    INFO] - The vector: None
-    [2022-05-09 10:28:40,731] [    INFO] - Response time 0.175514 s.
+    [2022-05-25 12:33:24,527] [    INFO] - vector score http client start
+    [2022-05-25 12:33:24,527] [    INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav
+    [2022-05-25 12:33:24,528] [    INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector/score
+    [2022-05-25 12:33:24,695] [    INFO] - The vector score is: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}}
+    [2022-05-25 12:33:24,696] [    INFO] - The vector: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}}
+    [2022-05-25 12:33:24,696] [    INFO] - Response time 0.168271 s.
  ```

 * Python API
@@ -360,10 +360,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
  输出:

  ``` bash
-  [2022-05-09 10:34:54,769] [    INFO] - vector score http client start
-  [2022-05-09 10:34:54,771] [    INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav
-  [2022-05-09 10:34:54,771] [    INFO] - endpoint: http://127.0.0.1:8590/paddlespeech/vector/score
-  [2022-05-09 10:34:55,026] [    INFO] - The vector score is: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.4292638897895813}}
+  [2022-05-25 12:30:14,143] [    INFO] - vector score http client start
+  [2022-05-25 12:30:14,143] [    INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav
+  [2022-05-25 12:30:14,143] [    INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector/score
+  [2022-05-25 12:30:14,363] [    INFO] - The vector score is: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}}
+  {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}}
  ```



--- a/demos/streaming_asr_server/conf/ws_conformer_application.yaml
+++ b/demos/streaming_asr_server/conf/ws_conformer_application.yaml
@@ -4,7 +4,7 @@
 #                             SERVER SETTING                                    #
 #################################################################################
 host: 0.0.0.0
-port: 8090
+port: 8091

 # The task format in the engin_list is: <speech task>_<engine type>
 # task choices = ['asr_online']

--- a/demos/streaming_asr_server/conf/ws_application.yaml
+++ b/demos/streaming_asr_server/conf/ws_application.yaml
--- a/demos/streaming_asr_server/websocket_client.py
+++ b/demos/streaming_asr_server/websocket_client.py
@@ -13,9 +13,7 @@
 # limitations under the License.
 #!/usr/bin/python
 # -*- coding: UTF-8 -*-
-
 # script for calc RTF: grep -rn RTF log.txt | awk '{print $NF}' | awk -F "=" '{sum += $NF} END {print "all time",sum, "audio num", NR,  "RTF", sum/NR}'
-
 import argparse
 import asyncio
 import codecs

--- a/docs/paddlespeech.pdf
+++ b/docs/paddlespeech.pdf
--- a/docs/source/asr/PPASR_cn.md
+++ b/docs/source/asr/PPASR_cn.md
@@ -92,5 +92,3 @@ server 的 demo： [streaming_asr_server](https://github.com/PaddlePaddle/Paddle
 ## 4. 快速开始

 关于如果使用 PP-ASR，可以看这里的 [install](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install_cn.md)，其中提供了 **简单**、**中等**、**困难** 三种安装方式。如果想体验 paddlespeech 的推理功能，可以用 **简单** 安装方式。
-
-
--- a/docs/source/install.md
+++ b/docs/source/install.md
@@ -4,7 +4,7 @@ There are 3 ways to use `PaddleSpeech`. According to the degree of difficulty, t

 | Way | Function                                                     | Support|
 |:---- |:----------------------------------------------------------- |:----|
-| Easy     | (1) Use command-line functions of PaddleSpeech. <br> (2) Experience PaddleSpeech on Ai Studio. | Linux, Mac(not support M1 chip)，Windows |
+| Easy     | (1) Use command-line functions of PaddleSpeech. <br> (2) Experience PaddleSpeech on Ai Studio. | Linux, Mac(not support M1 chip)，Windows ( For more information about installation, see [#1195](https://github.com/PaddlePaddle/PaddleSpeech/discussions/1195)) |
 | Medium     | Support major functions ，such as using the` ready-made `examples and using PaddleSpeech to train your model.                                           | Linux |
 | Hard     | Support full function of Paddlespeech, including using join ctc decoder with kaldi, training n-gram language model, Montreal-Forced-Aligner, and so on. And you are more able to be a developer! | Ubuntu |


--- a/docs/source/install_cn.md
+++ b/docs/source/install_cn.md
@@ -3,7 +3,7 @@
 `PaddleSpeech` 有三种安装方法。根据安装的难易程度，这三种方法可以分为 **简单**, **中等** 和 **困难**.
 | 方式 | 功能                                                         | 支持系统            |
 | :--- | :----------------------------------------------------------- | :------------------ |
-| 简单 | (1) 使用 PaddleSpeech 的命令行功能. <br> (2) 在 Aistudio上体验 PaddleSpeech. | Linux, Mac(不支持M1芯片)，Windows |
+| 简单 | (1) 使用 PaddleSpeech 的命令行功能. <br> (2) 在 Aistudio上体验 PaddleSpeech. | Linux, Mac(不支持M1芯片)，Windows (安装详情查看[#1195](https://github.com/PaddlePaddle/PaddleSpeech/discussions/1195)) |
 | 中等 | 支持 PaddleSpeech 主要功能，比如使用已有 examples 中的模型和使用 PaddleSpeech 来训练自己的模型. | Linux               |
 | 困难 | 支持 PaddleSpeech 的各项功能，包含结合kaldi使用 join ctc decoder 方式解码，训练语言模型,使用强制对齐等。并且你更能成为一名开发者！ | Ubuntu              |
 ## 先决条件

--- a/docs/source/released_model.md
+++ b/docs/source/released_model.md
@@ -82,7 +82,7 @@ PANN | ESC-50 |[pann-esc50](../../examples/esc50/cls0)|[esc50_cnn6.tar.gz](https

 Model Type | Dataset| Example Link | Pretrained Models | Static Models 
 :-------------:| :------------:| :-----: | :-----: | :-----:
-PANN | VoxCeleb| [voxceleb_ecapatdnn](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/voxceleb/sv0) | [ecapatdnn.tar.gz](https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz) | -
+ECAPA-TDNN | VoxCeleb| [voxceleb_ecapatdnn](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/voxceleb/sv0) | [ecapatdnn.tar.gz](https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_1.tar.gz) | -

 ## Punctuation Restoration Models
 Model Type | Dataset| Example Link | Pretrained Models

--- a/examples/aishell3/tts3/README.md
+++ b/examples/aishell3/tts3/README.md
@@ -6,15 +6,8 @@ AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpu
 We use AISHELL-3 to train a multi-speaker fastspeech2 model here.
 ## Dataset
 ### Download and Extract
-Download AISHELL-3.
-```bash
-wget https://www.openslr.org/resources/93/data_aishell3.tgz
-```
-Extract AISHELL-3.
-```bash
-mkdir data_aishell3
-tar zxvf data_aishell3.tgz -C data_aishell3
-```
+Download AISHELL-3 from it's [Official Website](http://www.aishelltech.com/aishell_3) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/data_aishell3`.
+ 
 ### Get MFA Result and Extract
 We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
 You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.

--- a/examples/aishell3/vc0/README.md
+++ b/examples/aishell3/vc0/README.md
@@ -6,15 +6,8 @@ This example contains code used to train a [Tacotron2](https://arxiv.org/abs/171

 ## Dataset
 ### Download and Extract
-Download AISHELL-3.
-```bash
-wget https://www.openslr.org/resources/93/data_aishell3.tgz
-```
-Extract AISHELL-3.
-```bash
-mkdir data_aishell3
-tar zxvf data_aishell3.tgz -C data_aishell3
-```
+Download AISHELL-3 from it's [Official Website](http://www.aishelltech.com/aishell_3) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/data_aishell3`.
+
 ### Get MFA Result and Extract
 We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get phonemes for Tacotron2, the durations of MFA are not needed here.
 You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.

--- a/examples/aishell3/vc1/README.md
+++ b/examples/aishell3/vc1/README.md
@@ -6,15 +6,8 @@ This example contains code used to train a [FastSpeech2](https://arxiv.org/abs/2

 ## Dataset
 ### Download and Extract
-Download AISHELL-3.
-```bash
-wget https://www.openslr.org/resources/93/data_aishell3.tgz
-```
-Extract AISHELL-3.
-```bash
-mkdir data_aishell3
-tar zxvf data_aishell3.tgz -C data_aishell3
-```
+Download AISHELL-3 from it's [Official Website](http://www.aishelltech.com/aishell_3) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/data_aishell3`.
+
 ### Get MFA Result and Extract
 We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
 You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.

--- a/examples/aishell3/voc1/README.md
+++ b/examples/aishell3/voc1/README.md
@@ -4,15 +4,8 @@ This example contains code used to train a [parallel wavegan](http://arxiv.org/a
 AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus that could be used to train multi-speaker Text-to-Speech (TTS) systems.
 ## Dataset
 ### Download and Extract
-Download AISHELL-3.
-```bash
-wget https://www.openslr.org/resources/93/data_aishell3.tgz
-```
-Extract AISHELL-3.
-```bash
-mkdir data_aishell3
-tar zxvf data_aishell3.tgz -C data_aishell3
-```
+Download AISHELL-3 from it's [Official Website](http://www.aishelltech.com/aishell_3) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/data_aishell3`.
+
 ### Get MFA Result and Extract
 We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
 You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.

--- a/examples/aishell3/voc5/README.md
+++ b/examples/aishell3/voc5/README.md
@@ -4,15 +4,7 @@ This example contains code used to train a [HiFiGAN](https://arxiv.org/abs/2010.
 AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus that could be used to train multi-speaker Text-to-Speech (TTS) systems.
 ## Dataset
 ### Download and Extract
-Download AISHELL-3.
-```bash
-wget https://www.openslr.org/resources/93/data_aishell3.tgz
-```
-Extract AISHELL-3.
-```bash
-mkdir data_aishell3
-tar zxvf data_aishell3.tgz -C data_aishell3
-```
+Download AISHELL-3 from it's [Official Website](http://www.aishelltech.com/aishell_3) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/data_aishell3`.
 ### Get MFA Result and Extract
 We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
 You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.

--- a/examples/ami/sd0/README.md
+++ b/examples/ami/sd0/README.md
@@ -26,4 +26,7 @@ Use the following command to run diarization on AMI corpus.
 ./run.sh  --data_folder ./amicorpus  --manual_annot_folder ./ami_public_manual_1.6.2
 ```

-## Results (DER) coming soon! :)
+## Best performance in terms of Diarization Error Rate (DER).
+  | System | Mic. |Orcl. (Dev)|Orcl. (Eval)| Est. (Dev) |Est. (Eval)|
+  | --------|-------- | ---------|----------- | --------|-----------|
+  | ECAPA-TDNN + SC  | HeadsetMix| 1.54 % | 3.07 %| 1.56 %| 3.28 %  |
--- a/examples/csmsc/tts0/README.md
+++ b/examples/csmsc/tts0/README.md
@@ -3,7 +3,7 @@ This example contains code used to train a [Tacotron2](https://arxiv.org/abs/171

 ## Dataset
 ### Download and Extract
-Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/source).
+Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.

 ### Get MFA Result and Extract
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get phonemes for Tacotron2, the durations of MFA are not needed here.

--- a/examples/csmsc/tts2/README.md
+++ b/examples/csmsc/tts2/README.md
@@ -3,7 +3,7 @@ This example contains code used to train a [SpeedySpeech](http://arxiv.org/abs/2

 ## Dataset
 ### Download and Extract
-Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/source).
+Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.

 ### Get MFA Result and Extract
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for SPEEDYSPEECH.

--- a/examples/csmsc/tts3/README.md
+++ b/examples/csmsc/tts3/README.md
@@ -4,7 +4,7 @@ This example contains code used to train a [Fastspeech2](https://arxiv.org/abs/2

 ## Dataset
 ### Download and Extract
-Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/source).
+Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.

 ### Get MFA Result and Extract
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2.

--- a/examples/csmsc/tts3/README_cn.md
+++ b/examples/csmsc/tts3/README_cn.md
@@ -5,7 +5,7 @@

 ## 数据集
 ### 下载并解压
-从 [官方网站](https://test.data-baker.com/data/index/source) 下载数据集
+从 [官方网站](https://test.data-baker.com/data/index/TNtts/) 下载数据集

 ### 获取MFA结果并解压
 我们使用 [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) 去获得 fastspeech2 的音素持续时间。

--- a/examples/csmsc/tts3/conf/default.yaml
+++ b/examples/csmsc/tts3/conf/default.yaml
--- a/examples/csmsc/vits/README.md
+++ b/examples/csmsc/vits/README.md
--- a/examples/csmsc/vits/conf/default.yaml
+++ b/examples/csmsc/vits/conf/default.yaml
+# This configuration tested on 4 GPUs (V100) with 32GB GPU
+# memory. It takes around 2 weeks to finish the training
+# but 100k iters model should generate reasonable results.
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+
+fs: 22050         # sr
+n_fft: 1024        # FFT size (samples).
+n_shift: 256       # Hop size (samples). 12.5ms
+win_length: null   # Window length (samples). 50ms
+                   # If set to null, it will be the same as fft_size.
+window: "hann"     # Window function.
+
+
+##########################################################
+#                  TTS MODEL SETTING                     #
+##########################################################
+model:
+    # generator related
+    generator_type: vits_generator
+    generator_params:
+        hidden_channels: 192
+        spks: -1
+        global_channels: -1
+        segment_size: 32
+        text_encoder_attention_heads: 2
+        text_encoder_ffn_expand: 4
+        text_encoder_blocks: 6
+        text_encoder_positionwise_layer_type: "conv1d"
+        text_encoder_positionwise_conv_kernel_size: 3
+        text_encoder_positional_encoding_layer_type: "rel_pos"
+        text_encoder_self_attention_layer_type: "rel_selfattn"
+        text_encoder_activation_type: "swish"
+        text_encoder_normalize_before: True
+        text_encoder_dropout_rate: 0.1
+        text_encoder_positional_dropout_rate: 0.0
+        text_encoder_attention_dropout_rate: 0.1
+        use_macaron_style_in_text_encoder: True
+        use_conformer_conv_in_text_encoder: False
+        text_encoder_conformer_kernel_size: -1
+        decoder_kernel_size: 7
+        decoder_channels: 512
+        decoder_upsample_scales: [8, 8, 2, 2]
+        decoder_upsample_kernel_sizes: [16, 16, 4, 4]
+        decoder_resblock_kernel_sizes: [3, 7, 11]
+        decoder_resblock_dilations: [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
+        use_weight_norm_in_decoder: True
+        posterior_encoder_kernel_size: 5
+        posterior_encoder_layers: 16
+        posterior_encoder_stacks: 1
+        posterior_encoder_base_dilation: 1
+        posterior_encoder_dropout_rate: 0.0
+        use_weight_norm_in_posterior_encoder: True
+        flow_flows: 4
+        flow_kernel_size: 5
+        flow_base_dilation: 1
+        flow_layers: 4
+        flow_dropout_rate: 0.0
+        use_weight_norm_in_flow: True
+        use_only_mean_in_flow: True
+        stochastic_duration_predictor_kernel_size: 3
+        stochastic_duration_predictor_dropout_rate: 0.5
+        stochastic_duration_predictor_flows: 4
+        stochastic_duration_predictor_dds_conv_layers: 3
+    # discriminator related
+    discriminator_type: hifigan_multi_scale_multi_period_discriminator
+    discriminator_params:
+        scales: 1
+        scale_downsample_pooling: "AvgPool1D"
+        scale_downsample_pooling_params:
+            kernel_size: 4
+            stride: 2
+            padding: 2
+        scale_discriminator_params:
+            in_channels: 1
+            out_channels: 1
+            kernel_sizes: [15, 41, 5, 3]
+            channels: 128
+            max_downsample_channels: 1024
+            max_groups: 16
+            bias: True
+            downsample_scales: [2, 2, 4, 4, 1]
+            nonlinear_activation: "leakyrelu"
+            nonlinear_activation_params:
+                negative_slope: 0.1
+            use_weight_norm: True
+            use_spectral_norm: False
+        follow_official_norm: False
+        periods: [2, 3, 5, 7, 11]
+        period_discriminator_params:
+            in_channels: 1
+            out_channels: 1
+            kernel_sizes: [5, 3]
+            channels: 32
+            downsample_scales: [3, 3, 3, 3, 1]
+            max_downsample_channels: 1024
+            bias: True
+            nonlinear_activation: "leakyrelu"
+            nonlinear_activation_params:
+                negative_slope: 0.1
+            use_weight_norm: True
+            use_spectral_norm: False
+    # others
+    sampling_rate: 22050          # needed in the inference for saving wav
+    cache_generator_outputs: True # whether to cache generator outputs in the training
+          
+###########################################################
+#                        LOSS SETTING                     #
+###########################################################
+# loss function related
+generator_adv_loss_params:
+    average_by_discriminators: False # whether to average loss value by #discriminators
+    loss_type: mse                   # loss type, "mse" or "hinge"
+discriminator_adv_loss_params:
+    average_by_discriminators: False # whether to average loss value by #discriminators
+    loss_type: mse                   # loss type, "mse" or "hinge"
+feat_match_loss_params:
+    average_by_discriminators: False # whether to average loss value by #discriminators
+    average_by_layers: False         # whether to average loss value by #layers of each discriminator
+    include_final_outputs: True      # whether to include final outputs for loss calculation
+mel_loss_params:
+    fs: 22050          # must be the same as the training data
+    fft_size: 1024        # fft points
+    hop_size: 256    # hop size
+    win_length: null   # window length
+    window: hann       # window type
+    num_mels: 80         # number of Mel basis
+    fmin: 0            # minimum frequency for Mel basis
+    fmax: null         # maximum frequency for Mel basis
+    log_base: null     # null represent natural log
+
+###########################################################
+#               ADVERSARIAL LOSS SETTING                  #
+###########################################################
+lambda_adv: 1.0        # loss scaling coefficient for adversarial loss
+lambda_mel: 45.0       # loss scaling coefficient for Mel loss
+lambda_feat_match: 2.0 # loss scaling coefficient for feat match loss
+lambda_dur: 1.0        # loss scaling coefficient for duration loss
+lambda_kl: 1.0         # loss scaling coefficient for KL divergence loss
+# others
+sampling_rate: 22050          # needed in the inference for saving wav
+cache_generator_outputs: True # whether to cache generator outputs in the training
+
+
+###########################################################
+#                  DATA LOADER SETTING                    #
+###########################################################
+batch_size: 64              # Batch size.
+num_workers: 4              # Number of workers in DataLoader.
+
+##########################################################
+#            OPTIMIZER & SCHEDULER SETTING               #
+##########################################################
+# optimizer setting for generator
+generator_optimizer_params:
+    beta1: 0.8
+    beta2: 0.99
+    epsilon: 1.0e-9
+    weight_decay: 0.0
+generator_scheduler: exponential_decay
+generator_scheduler_params:
+    learning_rate: 2.0e-4
+    gamma: 0.999875                   
+
+# optimizer setting for discriminator
+discriminator_optimizer_params:
+    beta1: 0.8
+    beta2: 0.99
+    epsilon: 1.0e-9
+    weight_decay: 0.0
+discriminator_scheduler: exponential_decay
+discriminator_scheduler_params:
+    learning_rate: 2.0e-4          
+    gamma: 0.999875
+generator_first: False # whether to start updating generator first
+
+##########################################################
+#                OTHER TRAINING SETTING                  #
+##########################################################
+max_epoch: 1000           # number of epochs
+num_snapshots: 10         # max number of snapshots to keep while training
+seed: 777                 # random seed number
--- a/examples/csmsc/vits/local/preprocess.sh
+++ b/examples/csmsc/vits/local/preprocess.sh
+#!/bin/bash
+
+stage=0
+stop_stage=100
+
+config_path=$1
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    # get durations from MFA's result
+    echo "Generate durations.txt from MFA results ..."
+    python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
+        --inputdir=./baker_alignment_tone \
+        --output=durations.txt \
+        --config=${config_path}
+fi
+
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    # extract features
+    echo "Extract features ..."
+    python3 ${BIN_DIR}/preprocess.py \
+        --dataset=baker \
+        --rootdir=~/datasets/BZNSYP/ \
+        --dumpdir=dump \
+        --dur-file=durations.txt \
+        --config=${config_path} \
+        --num-cpu=20 \
+        --cut-sil=True
+fi
+
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    # get features' stats(mean and std)
+    echo "Get features' stats ..."
+    python3 ${MAIN_ROOT}/utils/compute_statistics.py \
+        --metadata=dump/train/raw/metadata.jsonl \
+        --field-name="feats"
+fi
+
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    # normalize and covert phone/speaker to id, dev and test should use train's stats
+    echo "Normalize ..."
+    python3 ${BIN_DIR}/normalize.py \
+        --metadata=dump/train/raw/metadata.jsonl \
+        --dumpdir=dump/train/norm \
+        --feats-stats=dump/train/feats_stats.npy \
+        --phones-dict=dump/phone_id_map.txt \
+        --speaker-dict=dump/speaker_id_map.txt \
+        --skip-wav-copy
+
+    python3 ${BIN_DIR}/normalize.py \
+        --metadata=dump/dev/raw/metadata.jsonl \
+        --dumpdir=dump/dev/norm \
+        --feats-stats=dump/train/feats_stats.npy \
+        --phones-dict=dump/phone_id_map.txt \
+        --speaker-dict=dump/speaker_id_map.txt \
+        --skip-wav-copy
+
+    python3 ${BIN_DIR}/normalize.py \
+        --metadata=dump/test/raw/metadata.jsonl \
+        --dumpdir=dump/test/norm \
+        --feats-stats=dump/train/feats_stats.npy \
+        --phones-dict=dump/phone_id_map.txt \
+        --speaker-dict=dump/speaker_id_map.txt \
+        --skip-wav-copy
+fi
--- a/examples/csmsc/vits/local/synthesize.sh
+++ b/examples/csmsc/vits/local/synthesize.sh
+#!/bin/bash
+
+config_path=$1
+train_output_path=$2
+ckpt_name=$3
+stage=0
+stop_stage=0
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    FLAGS_allocator_strategy=naive_best_fit \
+    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+    python3 ${BIN_DIR}/synthesize.py \
+        --config=${config_path} \
+        --ckpt=${train_output_path}/checkpoints/${ckpt_name} \
+        --phones_dict=dump/phone_id_map.txt \
+        --test_metadata=dump/test/norm/metadata.jsonl \
+        --output_dir=${train_output_path}/test
+fi
\ No newline at end of file
--- a/examples/csmsc/vits/local/synthesize_e2e.sh
+++ b/examples/csmsc/vits/local/synthesize_e2e.sh
+#!/bin/bash
+
+config_path=$1
+train_output_path=$2
+ckpt_name=$3
+stage=0
+stop_stage=0
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    FLAGS_allocator_strategy=naive_best_fit \
+    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+    python3 ${BIN_DIR}/synthesize_e2e.py \
+        --config=${config_path} \
+        --ckpt=${train_output_path}/checkpoints/${ckpt_name} \
+        --phones_dict=dump/phone_id_map.txt \
+        --output_dir=${train_output_path}/test_e2e \
+        --text=${BIN_DIR}/../sentences.txt
+fi
--- a/examples/csmsc/vits/local/train.sh
+++ b/examples/csmsc/vits/local/train.sh
+#!/bin/bash
+
+config_path=$1
+train_output_path=$2
+
+python3 ${BIN_DIR}/train.py \
+    --train-metadata=dump/train/norm/metadata.jsonl \
+    --dev-metadata=dump/dev/norm/metadata.jsonl \
+    --config=${config_path} \
+    --output-dir=${train_output_path} \
+    --ngpu=4 \
+    --phones-dict=dump/phone_id_map.txt
--- a/examples/csmsc/vits/path.sh
+++ b/examples/csmsc/vits/path.sh
+#!/bin/bash
+export MAIN_ROOT=`realpath ${PWD}/../../../`
+
+export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
+export LC_ALL=C
+
+export PYTHONDONTWRITEBYTECODE=1
+# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
+export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
+
+MODEL=vits
+export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL}
\ No newline at end of file
--- a/examples/csmsc/vits/run.sh
+++ b/examples/csmsc/vits/run.sh
+#!/bin/bash
+
+set -e
+source path.sh
+
+gpus=0,1
+stage=0
+stop_stage=100
+
+conf_path=conf/default.yaml
+train_output_path=exp/default
+ckpt_name=snapshot_iter_153.pdz
+
+# with the following command, you can choose the stage range you want to run
+# such as `./run.sh --stage 0 --stop-stage 0`
+# this can not be mixed use with `$1`, `$2` ...
+source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    # prepare data
+    ./local/preprocess.sh ${conf_path} || exit -1
+fi
+
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    # train model, all `ckpt` under `train_output_path/checkpoints/` dir
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
+fi
+
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
+fi
+
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    # synthesize_e2e, vocoder is pwgan
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
+fi
--- a/examples/csmsc/voc1/README.md
+++ b/examples/csmsc/voc1/README.md
@@ -2,7 +2,7 @@
 This example contains code used to train a [parallel wavegan](http://arxiv.org/abs/1910.11480) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html).
 ## Dataset
 ### Download and Extract
-Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
+Download CSMSC from it's [official website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.

 ### Get MFA Result and Extract
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio.

--- a/examples/csmsc/voc3/README.md
+++ b/examples/csmsc/voc3/README.md
@@ -2,7 +2,7 @@
 This example contains code used to train a [Multi Band MelGAN](https://arxiv.org/abs/2005.05106) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html).
 ## Dataset
 ### Download and Extract
-Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
+Download CSMSC from it's [official website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.

 ### Get MFA Result and Extract
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut the silence in the edge of audio.

--- a/examples/csmsc/voc4/README.md
+++ b/examples/csmsc/voc4/README.md
@@ -2,7 +2,7 @@
 This example contains code used to train a [Style MelGAN](https://arxiv.org/abs/2011.01557) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html).
 ## Dataset
 ### Download and Extract
-Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
+Download CSMSC from it's [official website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.

 ### Get MFA Result and Extract
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut the silence in the edge of audio.

--- a/examples/csmsc/voc5/README.md
+++ b/examples/csmsc/voc5/README.md
@@ -2,7 +2,7 @@
 This example contains code used to train a [HiFiGAN](https://arxiv.org/abs/2010.05646) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html).
 ## Dataset
 ### Download and Extract
-Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
+Download CSMSC from it's [official website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.

 ### Get MFA Result and Extract
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio.

--- a/examples/csmsc/voc6/README.md
+++ b/examples/csmsc/voc6/README.md
@@ -2,7 +2,7 @@
 This example contains code used to train a [WaveRNN](https://arxiv.org/abs/1802.08435) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html).
 ## Dataset
 ### Download and Extract
-Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
+Download CSMSC from it's [official website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.

 ### Get MFA Result and Extract
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio.

--- a/examples/ljspeech/tts0/README.md
+++ b/examples/ljspeech/tts0/README.md
@@ -3,7 +3,7 @@ This example contains code used to train a [Tacotron2](https://arxiv.org/abs/171

 ## Dataset
 ### Download and Extract
-Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech-Dataset/).
+Download LJSpeech-1.1 from it's [Official Website](https://keithito.com/LJ-Speech-Dataset/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/LJSpeech-1.1`.

 ### Get MFA Result and Extract
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get phonemes for Tacotron2, the durations of MFA are not needed here.

--- a/examples/ljspeech/tts1/README.md
+++ b/examples/ljspeech/tts1/README.md
 # TransformerTTS with LJSpeech
 ## Dataset
-We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
-
-```bash
-wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
-tar xjvf LJSpeech-1.1.tar.bz2
-```
+### Download and Extract
+Download LJSpeech-1.1 from it's [Official Website](https://keithito.com/LJ-Speech-Dataset/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/LJSpeech-1.1`.
 ## Get Started
-Assume the path to the dataset is `~/datasets/LJSpeech-1.1`.
+Assume the path to the dataset is `~/datasets/LJSpeech-1.1` and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/LJSpeech-1.1`.
+
 Run the command below to
 1. **source path**.
 2. preprocess the dataset.

--- a/examples/ljspeech/tts3/README.md
+++ b/examples/ljspeech/tts3/README.md
@@ -3,7 +3,7 @@ This example contains code used to train a [Fastspeech2](https://arxiv.org/abs/2

 ## Dataset
 ### Download and Extract
-Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech-Dataset/).
+Download LJSpeech-1.1 from it's [Official Website](https://keithito.com/LJ-Speech-Dataset/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/LJSpeech-1.1`.

 ### Get MFA Result and Extract
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2.

--- a/examples/ljspeech/voc0/README.md
+++ b/examples/ljspeech/voc0/README.md
 # WaveFlow with LJSpeech
 ## Dataset
-We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
-
-```bash
-wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
-tar xjvf LJSpeech-1.1.tar.bz2
-```
+### Download and Extract
+Download LJSpeech-1.1 from it's [Official Website](https://keithito.com/LJ-Speech-Dataset/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/LJSpeech-1.1`.
 ## Get Started
 Assume the path to the dataset is `~/datasets/LJSpeech-1.1`.
 Assume the path to the Tacotron2 generated mels is `../tts0/output/test`.

--- a/examples/ljspeech/voc1/README.md
+++ b/examples/ljspeech/voc1/README.md
@@ -2,7 +2,7 @@
 This example contains code used to train a [parallel wavegan](http://arxiv.org/abs/1910.11480) model with [LJSpeech-1.1](https://keithito.com/LJ-Speech-Dataset/).
 ## Dataset
 ### Download and Extract
-Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech-Dataset/).
+Download LJSpeech-1.1 from it's [Official Website](https://keithito.com/LJ-Speech-Dataset/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/LJSpeech-1.1`.
 ### Get MFA Result and Extract
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut the silence in the edge of audio.
 You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.

--- a/examples/ljspeech/voc5/README.md
+++ b/examples/ljspeech/voc5/README.md
@@ -2,7 +2,7 @@
 This example contains code used to train a [HiFiGAN](https://arxiv.org/abs/2010.05646) model with [LJSpeech-1.1](https://keithito.com/LJ-Speech-Dataset/).
 ## Dataset
 ### Download and Extract
-Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech-Dataset/).
+Download LJSpeech-1.1 from it's [Official Website](https://keithito.com/LJ-Speech-Dataset/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/LJSpeech-1.1`.
 ### Get MFA Result and Extract
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut the silence in the edge of audio.
 You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.

--- a/examples/vctk/tts3/README.md
+++ b/examples/vctk/tts3/README.md
@@ -3,7 +3,7 @@ This example contains code used to train a [Fastspeech2](https://arxiv.org/abs/2

 ## Dataset
 ### Download and Extract the dataset
-Download VCTK-0.92 from the [official website](https://datashare.ed.ac.uk/handle/10283/3443).
+Download VCTK-0.92 from it's [Official Website](https://datashare.ed.ac.uk/handle/10283/3443) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/VCTK-Corpus-0.92`.

 ### Get MFA Result and Extract
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2.

--- a/examples/vctk/voc1/README.md
+++ b/examples/vctk/voc1/README.md
@@ -3,7 +3,7 @@ This example contains code used to train a [parallel wavegan](http://arxiv.org/a

 ## Dataset
 ### Download and Extract
-Download VCTK-0.92  from the [official website](https://datashare.ed.ac.uk/handle/10283/3443) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/VCTK-Corpus-0.92`.
+Download VCTK-0.92 from it's [Official Website](https://datashare.ed.ac.uk/handle/10283/3443) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/VCTK-Corpus-0.92`.

 ### Get MFA Result and Extract
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut the silence in the edge of audio.

--- a/examples/vctk/voc5/README.md
+++ b/examples/vctk/voc5/README.md
@@ -3,7 +3,7 @@ This example contains code used to train a [HiFiGAN](https://arxiv.org/abs/2010.

 ## Dataset
 ### Download and Extract
-Download VCTK-0.92  from the [official website](https://datashare.ed.ac.uk/handle/10283/3443) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/VCTK-Corpus-0.92`.
+Download VCTK-0.92 from it's [Official Website](https://datashare.ed.ac.uk/handle/10283/3443) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/VCTK-Corpus-0.92`.

 ### Get MFA Result and Extract
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut the silence in the edge of audio.

--- a/examples/voxceleb/sv0/README.md
+++ b/examples/voxceleb/sv0/README.md
@@ -141,11 +141,11 @@ using the `tar` scripts to unpack the model and then you can use the script to t

 For example:
 ```
-wget https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz
-tar -xvf sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz
+wget https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_1.tar.gz
+tar -xvf sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_1.tar.gz
 source path.sh
 # If you have processed the data and get the manifest file， you can skip the following 2 steps

-CUDA_VISIBLE_DEVICES= bash ./local/test.sh ./data sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_2/model/ conf/ecapa_tdnn.yaml
+CUDA_VISIBLE_DEVICES= bash ./local/test.sh ./data sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_1/model/ conf/ecapa_tdnn.yaml
 ```
 The performance of the released models are shown in [this](./RESULTS.md)
--- a/examples/voxceleb/sv0/RESULT.md
+++ b/examples/voxceleb/sv0/RESULT.md
@@ -4,4 +4,4 @@

 | Model | Number of Params | Release | Config | dim | Test set |  Cosine | Cosine + S-Norm | 
 | --- | --- | --- | --- | --- | --- | --- | ---- |
-| ECAPA-TDNN | 85M | 0.2.0 | conf/ecapa_tdnn.yaml |192 | test | 1.02 |  0.95 | 
+| ECAPA-TDNN | 85M | 0.2.1 | conf/ecapa_tdnn.yaml | 192 | test | 0.8188 | 0.7815|
--- a/examples/voxceleb/sv0/conf/ecapa_tdnn.yaml
+++ b/examples/voxceleb/sv0/conf/ecapa_tdnn.yaml
@@ -59,3 +59,11 @@ global_embedding_norm: True
 embedding_mean_norm: True
 embedding_std_norm: False

+###########################################
+#                score-norm               #
+###########################################
+score_norm: s-norm
+cohort_size: 20000 # amount of imposter utterances in normalization cohort
+n_train_snts: 400000 # used for normalization stats
+
+
--- a/examples/voxceleb/sv0/conf/ecapa_tdnn_small.yaml
+++ b/examples/voxceleb/sv0/conf/ecapa_tdnn_small.yaml
@@ -58,3 +58,10 @@ global_embedding_norm: True
 embedding_mean_norm: True
 embedding_std_norm: False

+###########################################
+#                score-norm               #
+###########################################
+score_norm: s-norm
+cohort_size: 20000 # amount of imposter utterances in normalization cohort
+n_train_snts: 400000 # used for normalization stats
+
--- a/paddlespeech/cli/asr/infer.py
+++ b/paddlespeech/cli/asr/infer.py
@@ -181,7 +181,7 @@ class ASRExecutor(BaseExecutor):
                    lm_url,
                    os.path.dirname(self.config.decode.lang_model_path), lm_md5)

-            elif "conformer" in model_type or "transformer" in model_type or "wenetspeech" in model_type:
+            elif "conformer" in model_type or "transformer" in model_type:
                self.config.spm_model_prefix = os.path.join(
                    self.res_path, self.config.spm_model_prefix)
                self.text_feature = TextFeaturizer(
@@ -205,7 +205,7 @@ class ASRExecutor(BaseExecutor):
        self.model.set_state_dict(model_dict)

        # compute the max len limit
-        if "conformer" in model_type or "transformer" in model_type or "wenetspeech" in model_type:
+        if "conformer" in model_type or "transformer" in model_type:
            # in transformer like model, we may use the subsample rate cnn network
            subsample_rate = self.model.subsampling_rate()
            frame_shift_ms = self.config.preprocess_config.process[0][
@@ -242,7 +242,7 @@ class ASRExecutor(BaseExecutor):
            self._inputs["audio_len"] = audio_len
            logger.info(f"audio feat shape: {audio.shape}")

-        elif "conformer" in model_type or "transformer" in model_type or "wenetspeech" in model_type:
+        elif "conformer" in model_type or "transformer" in model_type:
            logger.info("get the preprocess conf")
            preprocess_conf = self.config.preprocess_config
            preprocess_args = {"train": False}

--- a/paddlespeech/cli/cls/infer.py
+++ b/paddlespeech/cli/cls/infer.py
@@ -23,6 +23,7 @@ import paddle
 import yaml
 from paddleaudio import load
 from paddleaudio.features import LogMelSpectrogram
+from paddlespeech.utils.dynamic_import import dynamic_import

 from ..executor import BaseExecutor
 from ..log import logger
@@ -30,7 +31,7 @@ from ..utils import cli_register
 from ..utils import stats_wrapper
 from .pretrained_models import model_alias
 from .pretrained_models import pretrained_models
-from paddlespeech.s2t.utils.dynamic_import import dynamic_import
+

 __all__ = ['CLSExecutor']


--- a/paddlespeech/cli/download.py
+++ b/paddlespeech/cli/download.py
@@ -86,7 +86,7 @@ def get_path_from_url(url,
        str: a local path to save downloaded models & weights & datasets.
    """

-    from paddle.fluid.dygraph.parallel import ParallelEnv
+    from paddle.distributed import ParallelEnv

    assert _is_url(url), "downloading from {} not a url".format(url)
    # parse path after download to decompress under root_dir

--- a/paddlespeech/cli/st/infer.py
+++ b/paddlespeech/cli/st/infer.py
@@ -36,8 +36,8 @@ from .pretrained_models import kaldi_bins
 from .pretrained_models import model_alias
 from .pretrained_models import pretrained_models
 from paddlespeech.s2t.frontend.featurizer.text_featurizer import TextFeaturizer
-from paddlespeech.s2t.utils.dynamic_import import dynamic_import
 from paddlespeech.s2t.utils.utility import UpdateConfig
+from paddlespeech.utils.dynamic_import import dynamic_import

 __all__ = ["STExecutor"]


--- a/paddlespeech/cli/text/infer.py
+++ b/paddlespeech/cli/text/infer.py
@@ -21,7 +21,6 @@ from typing import Union

 import paddle

-from ...s2t.utils.dynamic_import import dynamic_import
 from ..executor import BaseExecutor
 from ..log import logger
 from ..utils import cli_register
@@ -29,6 +28,7 @@ from ..utils import stats_wrapper
 from .pretrained_models import model_alias
 from .pretrained_models import pretrained_models
 from .pretrained_models import tokenizer_alias
+from paddlespeech.utils.dynamic_import import dynamic_import

 __all__ = ['TextExecutor']


--- a/paddlespeech/cli/tts/infer.py
+++ b/paddlespeech/cli/tts/infer.py
@@ -32,10 +32,10 @@ from ..utils import cli_register
 from ..utils import stats_wrapper
 from .pretrained_models import model_alias
 from .pretrained_models import pretrained_models
-from paddlespeech.s2t.utils.dynamic_import import dynamic_import
 from paddlespeech.t2s.frontend import English
 from paddlespeech.t2s.frontend.zh_frontend import Frontend
 from paddlespeech.t2s.modules.normalizer import ZScore
+from paddlespeech.utils.dynamic_import import dynamic_import

 __all__ = ['TTSExecutor']


--- a/paddlespeech/cli/utils.py
+++ b/paddlespeech/cli/utils.py
@@ -24,11 +24,11 @@ from typing import Any
 from typing import Dict

 import paddle
+import paddleaudio
 import requests
 import yaml
 from paddle.framework import load

-import paddleaudio
 from . import download
 from .entry import commands
 try:

--- a/paddlespeech/cli/vector/infer.py
+++ b/paddlespeech/cli/vector/infer.py
@@ -32,7 +32,7 @@ from ..utils import cli_register
 from ..utils import stats_wrapper
 from .pretrained_models import model_alias
 from .pretrained_models import pretrained_models
-from paddlespeech.s2t.utils.dynamic_import import dynamic_import
+from paddlespeech.utils.dynamic_import import dynamic_import
 from paddlespeech.vector.io.batch import feature_normalize
 from paddlespeech.vector.modules.sid_model import SpeakerIdetification


--- a/paddlespeech/cli/vector/pretrained_models.py
+++ b/paddlespeech/cli/vector/pretrained_models.py
@@ -19,9 +19,9 @@ pretrained_models = {
    # "paddlespeech vector --task spk --model ecapatdnn_voxceleb12-16k --sr 16000 --input ./input.wav"
    "ecapatdnn_voxceleb12-16k": {
        'url':
-        'https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz',
+        'https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_1.tar.gz',
        'md5':
-        'cc33023c54ab346cd318408f43fcaf95',
+        '67c7ff8885d5246bd16e0f5ac1cba99f',
        'cfg_path':
        'conf/model.yaml',  # the yaml config path
        'ckpt_path':

--- a/paddlespeech/cls/exps/panns/predict.py
+++ b/paddlespeech/cls/exps/panns/predict.py
@@ -22,7 +22,7 @@ from paddleaudio.features import LogMelSpectrogram
 from paddleaudio.utils import logger

 from paddlespeech.cls.models import SoundClassifier
-from paddlespeech.s2t.utils.dynamic_import import dynamic_import
+from paddlespeech.utils.dynamic_import import dynamic_import

 # yapf: disable
 parser = argparse.ArgumentParser(__doc__)

--- a/paddlespeech/cls/exps/panns/train.py
+++ b/paddlespeech/cls/exps/panns/train.py
@@ -21,7 +21,7 @@ from paddleaudio.utils import logger
 from paddleaudio.utils import Timer

 from paddlespeech.cls.models import SoundClassifier
-from paddlespeech.s2t.utils.dynamic_import import dynamic_import
+from paddlespeech.utils.dynamic_import import dynamic_import

 # yapf: disable
 parser = argparse.ArgumentParser(__doc__)

--- a/paddlespeech/s2t/exps/deepspeech2/bin/export.py
+++ b/paddlespeech/s2t/exps/deepspeech2/bin/export.py
@@ -37,6 +37,12 @@ if __name__ == "__main__":
        "--export_path", type=str, help="path of the jit model to save")
    parser.add_argument(
        "--model_type", type=str, default='offline', help="offline/online")
+    parser.add_argument(
+        '--nxpu',
+        type=int,
+        default=0,
+        choices=[0, 1],
+        help="if nxpu == 0 and ngpu == 0, use cpu.")
    args = parser.parse_args()
    print("model_type:{}".format(args.model_type))
    print_arguments(args)

--- a/paddlespeech/s2t/exps/deepspeech2/bin/test.py
+++ b/paddlespeech/s2t/exps/deepspeech2/bin/test.py
@@ -37,6 +37,12 @@ if __name__ == "__main__":
    # save asr result to 
    parser.add_argument(
        "--result_file", type=str, help="path of save the asr result")
+    parser.add_argument(
+        '--nxpu',
+        type=int,
+        default=0,
+        choices=[0, 1],
+        help="if nxpu == 0 and ngpu == 0, use cpu.")
    args = parser.parse_args()
    print_arguments(args, globals())
    print("model_type:{}".format(args.model_type))

--- a/paddlespeech/s2t/exps/deepspeech2/bin/test_export.py
+++ b/paddlespeech/s2t/exps/deepspeech2/bin/test_export.py
@@ -40,6 +40,12 @@ if __name__ == "__main__":
        "--export_path", type=str, help="path of the jit model to save")
    parser.add_argument(
        "--model_type", type=str, default='offline', help='offline/online')
+    parser.add_argument(
+        '--nxpu',
+        type=int,
+        default=0,
+        choices=[0, 1],
+        help="if nxpu == 0 and ngpu == 0, use cpu.")
    parser.add_argument(
        "--enable-auto-log", action="store_true", help="use auto log")
    args = parser.parse_args()

--- a/paddlespeech/s2t/exps/deepspeech2/bin/train.py
+++ b/paddlespeech/s2t/exps/deepspeech2/bin/train.py
@@ -33,6 +33,12 @@ if __name__ == "__main__":
    parser = default_argument_parser()
    parser.add_argument(
        "--model_type", type=str, default='offline', help='offline/online')
+    parser.add_argument(
+        '--nxpu',
+        type=int,
+        default=0,
+        choices=[0, 1],
+        help="if nxpu == 0 and ngpu == 0, use cpu.")
    args = parser.parse_args()
    print("model_type:{}".format(args.model_type))
    print_arguments(args, globals())

--- a/paddlespeech/s2t/io/sampler.py
+++ b/paddlespeech/s2t/io/sampler.py
@@ -51,7 +51,7 @@ def _batch_shuffle(indices, batch_size, epoch, clipped=False):
    """
    rng = np.random.RandomState(epoch)
    shift_len = rng.randint(0, batch_size - 1)
-    batch_indices = list(zip(*[iter(indices[shift_len:])] * batch_size))
+    batch_indices = list(zip(* [iter(indices[shift_len:])] * batch_size))
    rng.shuffle(batch_indices)
    batch_indices = [item for batch in batch_indices for item in batch]
    assert clipped is False

--- a/paddlespeech/s2t/training/trainer.py
+++ b/paddlespeech/s2t/training/trainer.py
@@ -112,7 +112,16 @@ class Trainer():
        logger.info(f"Rank: {self.rank}/{self.world_size}")

        # set device
-        paddle.set_device('gpu' if self.args.ngpu > 0 else 'cpu')
+        if self.args.ngpu == 0:
+            if self.args.nxpu == 0:
+                paddle.set_device('cpu')
+            else:
+                paddle.set_device('xpu')
+        elif self.args.ngpu > 0:
+            paddle.set_device("gpu")
+        else:
+            raise Exception("invalid device")
+
        if self.parallel:
            self.init_parallel()


--- a/paddlespeech/server/bin/paddlespeech_client.py
+++ b/paddlespeech/server/bin/paddlespeech_client.py
@@ -752,6 +752,7 @@ class VectorClientExecutor(BaseExecutor):
            res = handler.run(enroll_audio, test_audio, audio_format,
                              sample_rate)
            logger.info(f"The vector score is: {res}")
+            return res
        else:
            logger.error(f"Sorry, we have not support such task {task}")


--- a/paddlespeech/server/download.py
+++ b/paddlespeech/server/download.py
-#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from __future__ import absolute_import
-from __future__ import division
-from __future__ import print_function
-
-import hashlib
-import os
-import os.path as osp
-import shutil
-import subprocess
-import tarfile
-import time
-import zipfile
-
-import requests
-from tqdm import tqdm
-
-from paddlespeech.cli.log import logger
-
-__all__ = ['get_path_from_url']
-
-DOWNLOAD_RETRY_LIMIT = 3
-
-
-def _is_url(path):
-    """
-    Whether path is URL.
-    Args:
-        path (string): URL string or not.
-    """
-    return path.startswith('http://') or path.startswith('https://')
-
-
-def _map_path(url, root_dir):
-    # parse path after download under root_dir
-    fname = osp.split(url)[-1]
-    fpath = fname
-    return osp.join(root_dir, fpath)
-
-
-def _get_unique_endpoints(trainer_endpoints):
-    # Sorting is to avoid different environmental variables for each card
-    trainer_endpoints.sort()
-    ips = set()
-    unique_endpoints = set()
-    for endpoint in trainer_endpoints:
-        ip = endpoint.split(":")[0]
-        if ip in ips:
-            continue
-        ips.add(ip)
-        unique_endpoints.add(endpoint)
-    logger.info("unique_endpoints {}".format(unique_endpoints))
-    return unique_endpoints
-
-
-def get_path_from_url(url,
-                      root_dir,
-                      md5sum=None,
-                      check_exist=True,
-                      decompress=True,
-                      method='get'):
-    """ Download from given url to root_dir.
-    if file or directory specified by url is exists under
-    root_dir, return the path directly, otherwise download
-    from url and decompress it, return the path.
-    Args:
-        url (str): download url
-        root_dir (str): root dir for downloading, it should be
-                        WEIGHTS_HOME or DATASET_HOME
-        md5sum (str): md5 sum of download package
-        decompress (bool): decompress zip or tar file. Default is `True`
-        method (str): which download method to use. Support `wget` and `get`. Default is `get`.
-    Returns:
-        str: a local path to save downloaded models & weights & datasets.
-    """
-
-    from paddle.fluid.dygraph.parallel import ParallelEnv
-
-    assert _is_url(url), "downloading from {} not a url".format(url)
-    # parse path after download to decompress under root_dir
-    fullpath = _map_path(url, root_dir)
-    # Mainly used to solve the problem of downloading data from different 
-    # machines in the case of multiple machines. Different ips will download 
-    # data, and the same ip will only download data once.
-    unique_endpoints = _get_unique_endpoints(ParallelEnv().trainer_endpoints[:])
-    if osp.exists(fullpath) and check_exist and _md5check(fullpath, md5sum):
-        logger.info("Found {}".format(fullpath))
-    else:
-        if ParallelEnv().current_endpoint in unique_endpoints:
-            fullpath = _download(url, root_dir, md5sum, method=method)
-        else:
-            while not os.path.exists(fullpath):
-                time.sleep(1)
-
-    if ParallelEnv().current_endpoint in unique_endpoints:
-        if decompress and (tarfile.is_tarfile(fullpath) or
-                           zipfile.is_zipfile(fullpath)):
-            fullpath = _decompress(fullpath)
-
-    return fullpath
-
-
-def _get_download(url, fullname):
-    # using requests.get method
-    fname = osp.basename(fullname)
-    try:
-        req = requests.get(url, stream=True)
-    except Exception as e:  # requests.exceptions.ConnectionError
-        logger.info("Downloading {} from {} failed with exception {}".format(
-            fname, url, str(e)))
-        return False
-
-    if req.status_code != 200:
-        raise RuntimeError("Downloading from {} failed with code "
-                           "{}!".format(url, req.status_code))
-
-    # For protecting download interupted, download to
-    # tmp_fullname firstly, move tmp_fullname to fullname
-    # after download finished
-    tmp_fullname = fullname + "_tmp"
-    total_size = req.headers.get('content-length')
-    with open(tmp_fullname, 'wb') as f:
-        if total_size:
-            with tqdm(total=(int(total_size) + 1023) // 1024) as pbar:
-                for chunk in req.iter_content(chunk_size=1024):
-                    f.write(chunk)
-                    pbar.update(1)
-        else:
-            for chunk in req.iter_content(chunk_size=1024):
-                if chunk:
-                    f.write(chunk)
-    shutil.move(tmp_fullname, fullname)
-
-    return fullname
-
-
-def _wget_download(url, fullname):
-    # using wget to download url
-    tmp_fullname = fullname + "_tmp"
-    # –user-agent
-    command = 'wget -O {} -t {} {}'.format(tmp_fullname, DOWNLOAD_RETRY_LIMIT,
-                                           url)
-    subprc = subprocess.Popen(
-        command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
-    _ = subprc.communicate()
-
-    if subprc.returncode != 0:
-        raise RuntimeError(
-            '{} failed. Please make sure `wget` is installed or {} exists'.
-            format(command, url))
-
-    shutil.move(tmp_fullname, fullname)
-
-    return fullname
-
-
-_download_methods = {
-    'get': _get_download,
-    'wget': _wget_download,
-}
-
-
-def _download(url, path, md5sum=None, method='get'):
-    """
-    Download from url, save to path.
-    url (str): download url
-    path (str): download to given path
-    md5sum (str): md5 sum of download package
-    method (str): which download method to use. Support `wget` and `get`. Default is `get`.
-    """
-    assert method in _download_methods, 'make sure `{}` implemented'.format(
-        method)
-
-    if not osp.exists(path):
-        os.makedirs(path)
-
-    fname = osp.split(url)[-1]
-    fullname = osp.join(path, fname)
-    retry_cnt = 0
-
-    logger.info("Downloading {} from {}".format(fname, url))
-    while not (osp.exists(fullname) and _md5check(fullname, md5sum)):
-        if retry_cnt < DOWNLOAD_RETRY_LIMIT:
-            retry_cnt += 1
-        else:
-            raise RuntimeError("Download from {} failed. "
-                               "Retry limit reached".format(url))
-
-        if not _download_methods[method](url, fullname):
-            time.sleep(1)
-            continue
-
-    return fullname
-
-
-def _md5check(fullname, md5sum=None):
-    if md5sum is None:
-        return True
-
-    logger.info("File {} md5 checking...".format(fullname))
-    md5 = hashlib.md5()
-    with open(fullname, 'rb') as f:
-        for chunk in iter(lambda: f.read(4096), b""):
-            md5.update(chunk)
-    calc_md5sum = md5.hexdigest()
-
-    if calc_md5sum != md5sum:
-        logger.info("File {} md5 check failed, {}(calc) != "
-                    "{}(base)".format(fullname, calc_md5sum, md5sum))
-        return False
-    return True
-
-
-def _decompress(fname):
-    """
-    Decompress for zip and tar file
-    """
-    logger.info("Decompressing {}...".format(fname))
-
-    # For protecting decompressing interupted,
-    # decompress to fpath_tmp directory firstly, if decompress
-    # successed, move decompress files to fpath and delete
-    # fpath_tmp and remove download compress file.
-
-    if tarfile.is_tarfile(fname):
-        uncompressed_path = _uncompress_file_tar(fname)
-    elif zipfile.is_zipfile(fname):
-        uncompressed_path = _uncompress_file_zip(fname)
-    else:
-        raise TypeError("Unsupport compress file type {}".format(fname))
-
-    return uncompressed_path
-
-
-def _uncompress_file_zip(filepath):
-    files = zipfile.ZipFile(filepath, 'r')
-    file_list = files.namelist()
-
-    file_dir = os.path.dirname(filepath)
-
-    if _is_a_single_file(file_list):
-        rootpath = file_list[0]
-        uncompressed_path = os.path.join(file_dir, rootpath)
-
-        for item in file_list:
-            files.extract(item, file_dir)
-
-    elif _is_a_single_dir(file_list):
-        rootpath = os.path.splitext(file_list[0])[0].split(os.sep)[0]
-        uncompressed_path = os.path.join(file_dir, rootpath)
-
-        for item in file_list:
-            files.extract(item, file_dir)
-
-    else:
-        rootpath = os.path.splitext(filepath)[0].split(os.sep)[-1]
-        uncompressed_path = os.path.join(file_dir, rootpath)
-        if not os.path.exists(uncompressed_path):
-            os.makedirs(uncompressed_path)
-        for item in file_list:
-            files.extract(item, os.path.join(file_dir, rootpath))
-
-    files.close()
-
-    return uncompressed_path
-
-
-def _uncompress_file_tar(filepath, mode="r:*"):
-    files = tarfile.open(filepath, mode)
-    file_list = files.getnames()
-
-    file_dir = os.path.dirname(filepath)
-
-    if _is_a_single_file(file_list):
-        rootpath = file_list[0]
-        uncompressed_path = os.path.join(file_dir, rootpath)
-        for item in file_list:
-            files.extract(item, file_dir)
-    elif _is_a_single_dir(file_list):
-        rootpath = os.path.splitext(file_list[0])[0].split(os.sep)[-1]
-        uncompressed_path = os.path.join(file_dir, rootpath)
-        for item in file_list:
-            files.extract(item, file_dir)
-    else:
-        rootpath = os.path.splitext(filepath)[0].split(os.sep)[-1]
-        uncompressed_path = os.path.join(file_dir, rootpath)
-        if not os.path.exists(uncompressed_path):
-            os.makedirs(uncompressed_path)
-
-        for item in file_list:
-            files.extract(item, os.path.join(file_dir, rootpath))
-
-    files.close()
-
-    return uncompressed_path
-
-
-def _is_a_single_file(file_list):
-    if len(file_list) == 1 and file_list[0].find(os.sep) < -1:
-        return True
-    return False
-
-
-def _is_a_single_dir(file_list):
-    new_file_list = []
-    for file_path in file_list:
-        if '/' in file_path:
-            file_path = file_path.replace('/', os.sep)
-        elif '\\' in file_path:
-            file_path = file_path.replace('\\', os.sep)
-        new_file_list.append(file_path)
-
-    file_name = new_file_list[0].split(os.sep)[0]
-    for i in range(1, len(new_file_list)):
-        if file_name != new_file_list[i].split(os.sep)[0]:
-            return False
-    return True
--- a/paddlespeech/server/engine/acs/__init__.py
+++ b/paddlespeech/server/engine/acs/__init__.py
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
--- a/paddlespeech/server/engine/acs/python/__init__.py
+++ b/paddlespeech/server/engine/acs/python/__init__.py
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
--- a/paddlespeech/server/engine/acs/python/acs_engine.py
+++ b/paddlespeech/server/engine/acs/python/acs_engine.py
@@ -16,6 +16,7 @@ import json
 import os
 import re

+import numpy as np
 import paddle
 import soundfile
 import websocket
@@ -44,11 +45,10 @@ class ACSEngine(BaseEngine):
        logger.info("Init the acs engine")
        try:
            self.config = config
-            if self.config.device:
-                self.device = self.config.device
-            else:
-                self.device = paddle.get_device()
+            self.device = self.config.get("device", paddle.get_device())

+            # websocket default ping timeout is 20 seconds
+            self.ping_timeout = self.config.get("ping_timeout", 20)
            paddle.set_device(self.device)
            logger.info(f"ACS Engine set the device: {self.device}")

@@ -100,8 +100,8 @@ class ACSEngine(BaseEngine):
            logger.error("No asr server, please input valid ip and port")
            return ""
        ws = websocket.WebSocket()
-        ws.connect(self.url)
-        # with websocket.WebSocket.connect(self.url) as ws:
+        logger.info(f"set the ping timeout: {self.ping_timeout} seconds")
+        ws.connect(self.url, ping_timeout=self.ping_timeout)
        audio_info = json.dumps(
            {
                "name": "test.wav",
@@ -116,8 +116,8 @@ class ACSEngine(BaseEngine):
        logger.info("client receive msg={}".format(msg))

        # send the total audio data
-        samples, sample_rate = soundfile.read(audio_data, dtype='int16')
-        ws.send_binary(samples.tobytes())
+        for chunk_data in self.read_wave(audio_data):
+            ws.send_binary(chunk_data.tobytes())
            msg = ws.recv()
            msg = json.loads(msg)
            logger.info(f"audio result: {msg}")
@@ -142,6 +142,39 @@ class ACSEngine(BaseEngine):

        return msg

+    def read_wave(self, audio_data: str):
+        """read the audio file from specific wavfile path
+
+        Args:
+            audio_data (str): the audio data, 
+                                 we assume that audio sample rate matches the model
+
+        Yields:
+            numpy.array: the samall package audio pcm data
+        """
+        samples, sample_rate = soundfile.read(audio_data, dtype='int16')
+        x_len = len(samples)
+        assert sample_rate == 16000
+
+        chunk_size = int(85 * sample_rate / 1000)  # 85ms, sample_rate = 16kHz
+
+        if x_len % chunk_size != 0:
+            padding_len_x = chunk_size - x_len % chunk_size
+        else:
+            padding_len_x = 0
+
+        padding = np.zeros((padding_len_x), dtype=samples.dtype)
+        padded_x = np.concatenate([samples, padding], axis=0)
+
+        assert (x_len + padding_len_x) % chunk_size == 0
+        num_chunk = (x_len + padding_len_x) / chunk_size
+        num_chunk = int(num_chunk)
+        for i in range(0, num_chunk):
+            start = i * chunk_size
+            end = start + chunk_size
+            x_chunk = padded_x[start:end]
+            yield x_chunk
+
    def get_macthed_word(self, msg):
        """Get the matched info in msg


--- a/paddlespeech/server/engine/asr/online/asr_engine.py
+++ b/paddlespeech/server/engine/asr/online/asr_engine.py
--- a/paddlespeech/server/engine/tts/online/python/tts_engine.py
+++ b/paddlespeech/server/engine/tts/online/python/tts_engine.py
@@ -25,7 +25,6 @@ from yacs.config import CfgNode
 from .pretrained_models import pretrained_models
 from paddlespeech.cli.log import logger
 from paddlespeech.cli.tts.infer import TTSExecutor
-from paddlespeech.s2t.utils.dynamic_import import dynamic_import
 from paddlespeech.server.engine.base_engine import BaseEngine
 from paddlespeech.server.utils.audio_process import float2pcm
 from paddlespeech.server.utils.util import denorm
@@ -33,6 +32,7 @@ from paddlespeech.server.utils.util import get_chunks
 from paddlespeech.t2s.frontend import English
 from paddlespeech.t2s.frontend.zh_frontend import Frontend
 from paddlespeech.t2s.modules.normalizer import ZScore
+from paddlespeech.utils.dynamic_import import dynamic_import

 __all__ = ['TTSEngine']


--- a/paddlespeech/server/restful/api.py
+++ b/paddlespeech/server/restful/api.py
@@ -17,12 +17,12 @@ from typing import List
 from fastapi import APIRouter

 from paddlespeech.cli.log import logger
+from paddlespeech.server.restful.acs_api import router as acs_router
 from paddlespeech.server.restful.asr_api import router as asr_router
 from paddlespeech.server.restful.cls_api import router as cls_router
 from paddlespeech.server.restful.text_api import router as text_router
 from paddlespeech.server.restful.tts_api import router as tts_router
 from paddlespeech.server.restful.vector_api import router as vec_router
-from paddlespeech.server.restful.acs_api import router as acs_router
 _router = APIRouter()



--- a/paddlespeech/server/util.py
+++ b/paddlespeech/server/util.py
@@ -29,9 +29,9 @@ import requests
 import yaml
 from paddle.framework import load

-from . import download
 from .entry import client_commands
 from .entry import server_commands
+from paddlespeech.cli import download
 try:
    from .. import __version__
 except ImportError:

--- a/paddlespeech/server/utils/audio_handler.py
+++ b/paddlespeech/server/utils/audio_handler.py
--- a/paddlespeech/server/utils/buffer.py
+++ b/paddlespeech/server/utils/buffer.py
@@ -12,6 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+
 class Frame(object):
    """Represents a "frame" of audio data."""

@@ -77,8 +78,8 @@ class ChunkBuffer(object):

        offset = 0
        while offset + self.window_bytes <= len(audio):
-            yield Frame(audio[offset:offset + self.window_bytes], self.timestamp,
-                        self.window_sec)
+            yield Frame(audio[offset:offset + self.window_bytes],
+                        self.timestamp, self.window_sec)
            self.timestamp += self.shift_sec
            offset += self.shift_bytes


--- a/paddlespeech/t2s/datasets/am_batch_fn.py
+++ b/paddlespeech/t2s/datasets/am_batch_fn.py
@@ -293,3 +293,45 @@ def transformer_single_spk_batch_fn(examples):
        "speech_lengths": speech_lengths,
    }
    return batch
+
+
+def vits_single_spk_batch_fn(examples):
+    """
+    Returns:
+        Dict[str, Any]:
+            - text (Tensor): Text index tensor (B, T_text).
+            - text_lengths (Tensor): Text length tensor (B,).
+            - feats (Tensor): Feature tensor (B, T_feats, aux_channels).
+            - feats_lengths (Tensor): Feature length tensor (B,).
+            - speech (Tensor): Speech waveform tensor (B, T_wav).
+
+    """
+    # fields = ["text", "text_lengths", "feats", "feats_lengths", "speech"]
+    text = [np.array(item["text"], dtype=np.int64) for item in examples]
+    feats = [np.array(item["feats"], dtype=np.float32) for item in examples]
+    speech = [np.array(item["wave"], dtype=np.float32) for item in examples]
+    text_lengths = [
+        np.array(item["text_lengths"], dtype=np.int64) for item in examples
+    ]
+    feats_lengths = [
+        np.array(item["feats_lengths"], dtype=np.int64) for item in examples
+    ]
+
+    text = batch_sequences(text)
+    feats = batch_sequences(feats)
+    speech = batch_sequences(speech)
+
+    # convert each batch to paddle.Tensor
+    text = paddle.to_tensor(text)
+    feats = paddle.to_tensor(feats)
+    text_lengths = paddle.to_tensor(text_lengths)
+    feats_lengths = paddle.to_tensor(feats_lengths)
+
+    batch = {
+        "text": text,
+        "text_lengths": text_lengths,
+        "feats": feats,
+        "feats_lengths": feats_lengths,
+        "speech": speech
+    }
+    return batch
--- a/paddlespeech/t2s/datasets/batch.py
+++ b/paddlespeech/t2s/datasets/batch.py
@@ -167,7 +167,6 @@ def batch_spec(minibatch, pad_value=0., time_major=False, dtype=np.float32):


 def batch_sequences(sequences, axis=0, pad_value=0):
-    # import pdb; pdb.set_trace()
    seq = sequences[0]
    ndim = seq.ndim
    if axis < 0:

--- a/paddlespeech/t2s/datasets/get_feats.py
+++ b/paddlespeech/t2s/datasets/get_feats.py
@@ -20,15 +20,14 @@ from scipy.interpolate import interp1d

 class LogMelFBank():
    def __init__(self,
-                 sr=24000,
-                 n_fft=2048,
-                 hop_length=300,
-                 win_length=None,
-                 window="hann",
-                 n_mels=80,
-                 fmin=80,
-                 fmax=7600,
-                 eps=1e-10):
+                 sr: int=24000,
+                 n_fft: int=2048,
+                 hop_length: int=300,
+                 win_length: int=None,
+                 window: str="hann",
+                 n_mels: int=80,
+                 fmin: int=80,
+                 fmax: int=7600):
        self.sr = sr
        # stft
        self.n_fft = n_fft
@@ -54,7 +53,7 @@ class LogMelFBank():
            fmax=self.fmax)
        return mel_filter

-    def _stft(self, wav):
+    def _stft(self, wav: np.ndarray):
        D = librosa.core.stft(
            wav,
            n_fft=self.n_fft,
@@ -65,11 +64,11 @@ class LogMelFBank():
            pad_mode=self.pad_mode)
        return D

-    def _spectrogram(self, wav):
+    def _spectrogram(self, wav: np.ndarray):
        D = self._stft(wav)
        return np.abs(D)

-    def _mel_spectrogram(self, wav):
+    def _mel_spectrogram(self, wav: np.ndarray):
        S = self._spectrogram(wav)
        mel = np.dot(self.mel_filter, S)
        return mel
@@ -90,14 +89,18 @@ class LogMelFBank():


 class Pitch():
-    def __init__(self, sr=24000, hop_length=300, f0min=80, f0max=7600):
+    def __init__(self,
+                 sr: int=24000,
+                 hop_length: int=300,
+                 f0min: int=80,
+                 f0max: int=7600):

        self.sr = sr
        self.hop_length = hop_length
        self.f0min = f0min
        self.f0max = f0max

-    def _convert_to_continuous_f0(self, f0: np.array) -> np.array:
+    def _convert_to_continuous_f0(self, f0: np.ndarray) -> np.ndarray:
        if (f0 == 0).all():
            print("All frames seems to be unvoiced.")
            return f0
@@ -120,9 +123,9 @@ class Pitch():
        return f0

    def _calculate_f0(self,
-                      input: np.array,
-                      use_continuous_f0=True,
-                      use_log_f0=True) -> np.array:
+                      input: np.ndarray,
+                      use_continuous_f0: bool=True,
+                      use_log_f0: bool=True) -> np.ndarray:
        input = input.astype(np.float)
        frame_period = 1000 * self.hop_length / self.sr
        f0, timeaxis = pyworld.dio(
@@ -139,7 +142,8 @@ class Pitch():
            f0[nonzero_idxs] = np.log(f0[nonzero_idxs])
        return f0.reshape(-1)

-    def _average_by_duration(self, input: np.array, d: np.array) -> np.array:
+    def _average_by_duration(self, input: np.ndarray,
+                             d: np.ndarray) -> np.ndarray:
        d_cumsum = np.pad(d.cumsum(0), (1, 0), 'constant')
        arr_list = []
        for start, end in zip(d_cumsum[:-1], d_cumsum[1:]):
@@ -154,11 +158,11 @@ class Pitch():
        return arr_list

    def get_pitch(self,
-                  wav,
-                  use_continuous_f0=True,
-                  use_log_f0=True,
-                  use_token_averaged_f0=True,
-                  duration=None):
+                  wav: np.ndarray,
+                  use_continuous_f0: bool=True,
+                  use_log_f0: bool=True,
+                  use_token_averaged_f0: bool=True,
+                  duration: np.ndarray=None):
        f0 = self._calculate_f0(wav, use_continuous_f0, use_log_f0)
        if use_token_averaged_f0 and duration is not None:
            f0 = self._average_by_duration(f0, duration)
@@ -167,15 +171,13 @@ class Pitch():

 class Energy():
    def __init__(self,
-                 sr=24000,
-                 n_fft=2048,
-                 hop_length=300,
-                 win_length=None,
-                 window="hann",
-                 center=True,
-                 pad_mode="reflect"):
+                 n_fft: int=2048,
+                 hop_length: int=300,
+                 win_length: int=None,
+                 window: str="hann",
+                 center: bool=True,
+                 pad_mode: str="reflect"):

-        self.sr = sr
        self.n_fft = n_fft
        self.win_length = win_length
        self.hop_length = hop_length
@@ -183,7 +185,7 @@ class Energy():
        self.center = center
        self.pad_mode = pad_mode

-    def _stft(self, wav):
+    def _stft(self, wav: np.ndarray):
        D = librosa.core.stft(
            wav,
            n_fft=self.n_fft,
@@ -194,7 +196,7 @@ class Energy():
            pad_mode=self.pad_mode)
        return D

-    def _calculate_energy(self, input):
+    def _calculate_energy(self, input: np.ndarray):
        input = input.astype(np.float32)
        input_stft = self._stft(input)
        input_power = np.abs(input_stft)**2
@@ -203,7 +205,8 @@ class Energy():
                np.sum(input_power, axis=0), a_min=1.0e-10, a_max=float('inf')))
        return energy

-    def _average_by_duration(self, input: np.array, d: np.array) -> np.array:
+    def _average_by_duration(self, input: np.ndarray,
+                             d: np.ndarray) -> np.ndarray:
        d_cumsum = np.pad(d.cumsum(0), (1, 0), 'constant')
        arr_list = []
        for start, end in zip(d_cumsum[:-1], d_cumsum[1:]):
@@ -214,8 +217,49 @@ class Energy():
        arr_list = np.expand_dims(np.array(arr_list), 0).T
        return arr_list

-    def get_energy(self, wav, use_token_averaged_energy=True, duration=None):
+    def get_energy(self,
+                   wav: np.ndarray,
+                   use_token_averaged_energy: bool=True,
+                   duration: np.ndarray=None):
        energy = self._calculate_energy(wav)
        if use_token_averaged_energy and duration is not None:
            energy = self._average_by_duration(energy, duration)
        return energy
+
+
+class LinearSpectrogram():
+    def __init__(
+            self,
+            n_fft: int=1024,
+            win_length: int=None,
+            hop_length: int=256,
+            window: str="hann",
+            center: bool=True, ):
+        self.n_fft = n_fft
+        self.hop_length = hop_length
+        self.win_length = win_length
+        self.window = window
+        self.center = center
+        self.n_fft = n_fft
+        self.pad_mode = "reflect"
+
+    def _stft(self, wav: np.ndarray):
+        D = librosa.core.stft(
+            wav,
+            n_fft=self.n_fft,
+            hop_length=self.hop_length,
+            win_length=self.win_length,
+            window=self.window,
+            center=self.center,
+            pad_mode=self.pad_mode)
+        return D
+
+    def _spectrogram(self, wav: np.ndarray):
+        D = self._stft(wav)
+        return np.abs(D)
+
+    def get_linear_spectrogram(self, wav: np.ndarray):
+        linear_spectrogram = self._spectrogram(wav)
+        linear_spectrogram = np.clip(
+            linear_spectrogram, a_min=1e-10, a_max=float("inf"))
+        return linear_spectrogram.T
--- a/paddlespeech/t2s/exps/fastspeech2/preprocess.py
+++ b/paddlespeech/t2s/exps/fastspeech2/preprocess.py
@@ -147,10 +147,17 @@ def process_sentences(config,
                      spk_emb_dir: Path=None):
    if nprocs == 1:
        results = []
-        for fp in fps:
-            record = process_sentence(config, fp, sentences, output_dir,
-                                      mel_extractor, pitch_extractor,
-                                      energy_extractor, cut_sil, spk_emb_dir)
+        for fp in tqdm.tqdm(fps, total=len(fps)):
+            record = process_sentence(
+                config=config,
+                fp=fp,
+                sentences=sentences,
+                output_dir=output_dir,
+                mel_extractor=mel_extractor,
+                pitch_extractor=pitch_extractor,
+                energy_extractor=energy_extractor,
+                cut_sil=cut_sil,
+                spk_emb_dir=spk_emb_dir)
            if record:
                results.append(record)
    else:
@@ -325,7 +332,6 @@ def main():
        f0min=config.f0min,
        f0max=config.f0max)
    energy_extractor = Energy(
-        sr=config.fs,
        n_fft=config.n_fft,
        hop_length=config.n_shift,
        win_length=config.win_length,
@@ -334,36 +340,36 @@ def main():
    # process for the 3 sections
    if train_wav_files:
        process_sentences(
-            config,
-            train_wav_files,
-            sentences,
-            train_dump_dir,
-            mel_extractor,
-            pitch_extractor,
-            energy_extractor,
+            config=config,
+            fps=train_wav_files,
+            sentences=sentences,
+            output_dir=train_dump_dir,
+            mel_extractor=mel_extractor,
+            pitch_extractor=pitch_extractor,
+            energy_extractor=energy_extractor,
            nprocs=args.num_cpu,
            cut_sil=args.cut_sil,
            spk_emb_dir=spk_emb_dir)
    if dev_wav_files:
        process_sentences(
-            config,
-            dev_wav_files,
-            sentences,
-            dev_dump_dir,
-            mel_extractor,
-            pitch_extractor,
-            energy_extractor,
+            config=config,
+            fps=dev_wav_files,
+            sentences=sentences,
+            output_dir=dev_dump_dir,
+            mel_extractor=mel_extractor,
+            pitch_extractor=pitch_extractor,
+            energy_extractor=energy_extractor,
            cut_sil=args.cut_sil,
            spk_emb_dir=spk_emb_dir)
    if test_wav_files:
        process_sentences(
-            config,
-            test_wav_files,
-            sentences,
-            test_dump_dir,
-            mel_extractor,
-            pitch_extractor,
-            energy_extractor,
+            config=config,
+            fps=test_wav_files,
+            sentences=sentences,
+            output_dir=test_dump_dir,
+            mel_extractor=mel_extractor,
+            pitch_extractor=pitch_extractor,
+            energy_extractor=energy_extractor,
            nprocs=args.num_cpu,
            cut_sil=args.cut_sil,
            spk_emb_dir=spk_emb_dir)

--- a/paddlespeech/t2s/exps/gan_vocoder/preprocess.py
+++ b/paddlespeech/t2s/exps/gan_vocoder/preprocess.py
@@ -88,15 +88,17 @@ def process_sentence(config: Dict[str, Any],
                y, (0, num_frames * config.n_shift - y.size), mode="reflect")
        else:
            y = y[:num_frames * config.n_shift]
-        num_sample = y.shape[0]
+        num_samples = y.shape[0]

        mel_path = output_dir / (utt_id + "_feats.npy")
        wav_path = output_dir / (utt_id + "_wave.npy")
-        np.save(wav_path, y)  # (num_samples, )
-        np.save(mel_path, logmel)  # (num_frames, n_mels)
+        # (num_samples, )
+        np.save(wav_path, y)
+        # (num_frames, n_mels)
+        np.save(mel_path, logmel)
        record = {
            "utt_id": utt_id,
-            "num_samples": num_sample,
+            "num_samples": num_samples,
            "num_frames": num_frames,
            "feats": str(mel_path),
            "wave": str(wav_path),
@@ -111,11 +113,17 @@ def process_sentences(config,
                      mel_extractor=None,
                      nprocs: int=1,
                      cut_sil: bool=True):
+
    if nprocs == 1:
        results = []
        for fp in tqdm.tqdm(fps, total=len(fps)):
-            record = process_sentence(config, fp, sentences, output_dir,
-                                      mel_extractor, cut_sil)
+            record = process_sentence(
+                config=config,
+                fp=fp,
+                sentences=sentences,
+                output_dir=output_dir,
+                mel_extractor=mel_extractor,
+                cut_sil=cut_sil)
            if record:
                results.append(record)
    else:
@@ -150,7 +158,7 @@ def main():
        "--dataset",
        default="baker",
        type=str,
-        help="name of dataset, should in {baker, ljspeech, vctk} now")
+        help="name of dataset, should in {baker, aishell3, ljspeech, vctk} now")
    parser.add_argument(
        "--rootdir", default=None, type=str, help="directory to dataset.")
    parser.add_argument(
@@ -264,28 +272,28 @@ def main():
    # process for the 3 sections
    if train_wav_files:
        process_sentences(
-            config,
-            train_wav_files,
-            sentences,
-            train_dump_dir,
+            config=config,
+            fps=train_wav_files,
+            sentences=sentences,
+            output_dir=train_dump_dir,
            mel_extractor=mel_extractor,
            nprocs=args.num_cpu,
            cut_sil=args.cut_sil)
    if dev_wav_files:
        process_sentences(
-            config,
-            dev_wav_files,
-            sentences,
-            dev_dump_dir,
+            config=config,
+            fps=dev_wav_files,
+            sentences=sentences,
+            output_dir=dev_dump_dir,
            mel_extractor=mel_extractor,
            nprocs=args.num_cpu,
            cut_sil=args.cut_sil)
    if test_wav_files:
        process_sentences(
-            config,
-            test_wav_files,
-            sentences,
-            test_dump_dir,
+            config=config,
+            fps=test_wav_files,
+            sentences=sentences,
+            output_dir=test_dump_dir,
            mel_extractor=mel_extractor,
            nprocs=args.num_cpu,
            cut_sil=args.cut_sil)

--- a/paddlespeech/t2s/exps/speedyspeech/preprocess.py
+++ b/paddlespeech/t2s/exps/speedyspeech/preprocess.py
@@ -126,11 +126,17 @@ def process_sentences(config,
                      nprocs: int=1,
                      cut_sil: bool=True,
                      use_relative_path: bool=False):
+
    if nprocs == 1:
        results = []
        for fp in tqdm.tqdm(fps, total=len(fps)):
-            record = process_sentence(config, fp, sentences, output_dir,
-                                      mel_extractor, cut_sil)
+            record = process_sentence(
+                config=config,
+                fp=fp,
+                sentences=sentences,
+                output_dir=output_dir,
+                mel_extractor=mel_extractor,
+                cut_sil=cut_sil)
            if record:
                results.append(record)
    else:
@@ -268,30 +274,30 @@ def main():
    # process for the 3 sections
    if train_wav_files:
        process_sentences(
-            config,
-            train_wav_files,
-            sentences,
-            train_dump_dir,
-            mel_extractor,
+            config=config,
+            fps=train_wav_files,
+            sentences=sentences,
+            output_dir=train_dump_dir,
+            mel_extractor=mel_extractor,
            nprocs=args.num_cpu,
            cut_sil=args.cut_sil,
            use_relative_path=args.use_relative_path)
    if dev_wav_files:
        process_sentences(
-            config,
-            dev_wav_files,
-            sentences,
-            dev_dump_dir,
-            mel_extractor,
+            config=config,
+            fps=dev_wav_files,
+            sentences=sentences,
+            output_dir=dev_dump_dir,
+            mel_extractor=mel_extractor,
            cut_sil=args.cut_sil,
            use_relative_path=args.use_relative_path)
    if test_wav_files:
        process_sentences(
-            config,
-            test_wav_files,
-            sentences,
-            test_dump_dir,
-            mel_extractor,
+            config=config,
+            fps=test_wav_files,
+            sentences=sentences,
+            output_dir=test_dump_dir,
+            mel_extractor=mel_extractor,
            nprocs=args.num_cpu,
            cut_sil=args.cut_sil,
            use_relative_path=args.use_relative_path)

--- a/paddlespeech/t2s/exps/speedyspeech/synthesize_e2e.py
+++ b/paddlespeech/t2s/exps/speedyspeech/synthesize_e2e.py
@@ -176,7 +176,10 @@ def main():
    parser.add_argument(
        "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu or xpu.")
    parser.add_argument(
-        "--nxpu", type=int, default=0, help="if nxpu == 0 and ngpu == 0, use cpu.")
+        "--nxpu",
+        type=int,
+        default=0,
+        help="if nxpu == 0 and ngpu == 0, use cpu.")

    args, _ = parser.parse_known_args()


--- a/paddlespeech/t2s/exps/speedyspeech/train.py
+++ b/paddlespeech/t2s/exps/speedyspeech/train.py
@@ -188,7 +188,10 @@ def main():
    parser.add_argument("--dev-metadata", type=str, help="dev data.")
    parser.add_argument("--output-dir", type=str, help="output dir.")
    parser.add_argument(
-        "--nxpu", type=int, default=0, help="if nxpu == 0 and ngpu == 0, use cpu.")
+        "--nxpu",
+        type=int,
+        default=0,
+        help="if nxpu == 0 and ngpu == 0, use cpu.")
    parser.add_argument(
        "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu or xpu")


--- a/paddlespeech/t2s/exps/syn_utils.py
+++ b/paddlespeech/t2s/exps/syn_utils.py
@@ -27,11 +27,11 @@ from paddle import jit
 from paddle.static import InputSpec
 from yacs.config import CfgNode

-from paddlespeech.s2t.utils.dynamic_import import dynamic_import
 from paddlespeech.t2s.datasets.data_table import DataTable
 from paddlespeech.t2s.frontend import English
 from paddlespeech.t2s.frontend.zh_frontend import Frontend
 from paddlespeech.t2s.modules.normalizer import ZScore
+from paddlespeech.utils.dynamic_import import dynamic_import

 model_alias = {
    # acoustic model

--- a/paddlespeech/t2s/exps/synthesize.py
+++ b/paddlespeech/t2s/exps/synthesize.py
@@ -125,7 +125,7 @@ def evaluate(args):


 def parse_args():
-    # parse args and config and redirect to train_sp
+    # parse args and config
    parser = argparse.ArgumentParser(
        description="Synthesize with acoustic model & vocoder")
    # acoustic model
@@ -143,7 +143,7 @@ def parse_args():
        '--am_config',
        type=str,
        default=None,
-        help='Config of acoustic model. Use deault config when it is None.')
+        help='Config of acoustic model.')
    parser.add_argument(
        '--am_ckpt',
        type=str,
@@ -182,7 +182,7 @@ def parse_args():
        '--voc_config',
        type=str,
        default=None,
-        help='Config of voc. Use deault config when it is None.')
+        help='Config of voc.')
    parser.add_argument(
        '--voc_ckpt', type=str, default=None, help='Checkpoint file of voc.')
    parser.add_argument(

--- a/paddlespeech/t2s/exps/synthesize_e2e.py
+++ b/paddlespeech/t2s/exps/synthesize_e2e.py
@@ -159,7 +159,7 @@ def evaluate(args):


 def parse_args():
-    # parse args and config and redirect to train_sp
+    # parse args and config
    parser = argparse.ArgumentParser(
        description="Synthesize with acoustic model & vocoder")
    # acoustic model
@@ -177,7 +177,7 @@ def parse_args():
        '--am_config',
        type=str,
        default=None,
-        help='Config of acoustic model. Use deault config when it is None.')
+        help='Config of acoustic model.')
    parser.add_argument(
        '--am_ckpt',
        type=str,
@@ -223,7 +223,7 @@ def parse_args():
        '--voc_config',
        type=str,
        default=None,
-        help='Config of voc. Use deault config when it is None.')
+        help='Config of voc.')
    parser.add_argument(
        '--voc_ckpt', type=str, default=None, help='Checkpoint file of voc.')
    parser.add_argument(

--- a/paddlespeech/t2s/exps/synthesize_streaming.py
+++ b/paddlespeech/t2s/exps/synthesize_streaming.py
@@ -24,7 +24,6 @@ from paddle.static import InputSpec
 from timer import timer
 from yacs.config import CfgNode

-from paddlespeech.s2t.utils.dynamic_import import dynamic_import
 from paddlespeech.t2s.exps.syn_utils import denorm
 from paddlespeech.t2s.exps.syn_utils import get_chunks
 from paddlespeech.t2s.exps.syn_utils import get_frontend
@@ -33,6 +32,7 @@ from paddlespeech.t2s.exps.syn_utils import get_voc_inference
 from paddlespeech.t2s.exps.syn_utils import model_alias
 from paddlespeech.t2s.exps.syn_utils import voc_to_static
 from paddlespeech.t2s.utils import str2bool
+from paddlespeech.utils.dynamic_import import dynamic_import


 def evaluate(args):
@@ -201,7 +201,7 @@ def evaluate(args):


 def parse_args():
-    # parse args and config and redirect to train_sp
+    # parse args and config
    parser = argparse.ArgumentParser(
        description="Synthesize with acoustic model & vocoder")
    # acoustic model
@@ -212,10 +212,7 @@ def parse_args():
        choices=['fastspeech2_csmsc'],
        help='Choose acoustic model type of tts task.')
    parser.add_argument(
-        '--am_config',
-        type=str,
-        default=None,
-        help='Config of acoustic model. Use deault config when it is None.')
+        '--am_config', type=str, default=None, help='Config of acoustic model.')
    parser.add_argument(
        '--am_ckpt',
        type=str,
@@ -245,10 +242,7 @@ def parse_args():
        ],
        help='Choose vocoder type of tts task.')
    parser.add_argument(
-        '--voc_config',
-        type=str,
-        default=None,
-        help='Config of voc. Use deault config when it is None.')
+        '--voc_config', type=str, default=None, help='Config of voc.')
    parser.add_argument(
        '--voc_ckpt', type=str, default=None, help='Checkpoint file of voc.')
    parser.add_argument(

--- a/paddlespeech/t2s/exps/tacotron2/preprocess.py
+++ b/paddlespeech/t2s/exps/tacotron2/preprocess.py
@@ -125,9 +125,15 @@ def process_sentences(config,
                      spk_emb_dir: Path=None):
    if nprocs == 1:
        results = []
-        for fp in fps:
-            record = process_sentence(config, fp, sentences, output_dir,
-                                      mel_extractor, cut_sil, spk_emb_dir)
+        for fp in tqdm.tqdm(fps, total=len(fps)):
+            record = process_sentence(
+                config=config,
+                fp=fp,
+                sentences=sentences,
+                output_dir=output_dir,
+                mel_extractor=mel_extractor,
+                cut_sil=cut_sil,
+                spk_emb_dir=spk_emb_dir)
            if record:
                results.append(record)
    else:
@@ -299,30 +305,30 @@ def main():
    # process for the 3 sections
    if train_wav_files:
        process_sentences(
-            config,
-            train_wav_files,
-            sentences,
-            train_dump_dir,
-            mel_extractor,
+            config=config,
+            fps=train_wav_files,
+            sentences=sentences,
+            output_dir=train_dump_dir,
+            mel_extractor=mel_extractor,
            nprocs=args.num_cpu,
            cut_sil=args.cut_sil,
            spk_emb_dir=spk_emb_dir)
    if dev_wav_files:
        process_sentences(
-            config,
-            dev_wav_files,
-            sentences,
-            dev_dump_dir,
-            mel_extractor,
+            config=config,
+            fps=dev_wav_files,
+            sentences=sentences,
+            output_dir=dev_dump_dir,
+            mel_extractor=mel_extractor,
            cut_sil=args.cut_sil,
            spk_emb_dir=spk_emb_dir)
    if test_wav_files:
        process_sentences(
-            config,
-            test_wav_files,
-            sentences,
-            test_dump_dir,
-            mel_extractor,
+            config=config,
+            fps=test_wav_files,
+            sentences=sentences,
+            output_dir=test_dump_dir,
+            mel_extractor=mel_extractor,
            nprocs=args.num_cpu,
            cut_sil=args.cut_sil,
            spk_emb_dir=spk_emb_dir)

--- a/paddlespeech/t2s/exps/transformer_tts/preprocess.py
+++ b/paddlespeech/t2s/exps/transformer_tts/preprocess.py
@@ -125,11 +125,16 @@ def process_sentences(config,
                      output_dir: Path,
                      mel_extractor=None,
                      nprocs: int=1):
+
    if nprocs == 1:
        results = []
        for fp in tqdm.tqdm(fps, total=len(fps)):
-            record = process_sentence(config, fp, sentences, output_dir,
-                                      mel_extractor)
+            record = process_sentence(
+                config=config,
+                fp=fp,
+                sentences=sentences,
+                output_dir=output_dir,
+                mel_extractor=mel_extractor)
            if record:
                results.append(record)
    else:
@@ -247,27 +252,27 @@ def main():
    # process for the 3 sections
    if train_wav_files:
        process_sentences(
-            config,
-            train_wav_files,
-            sentences,
-            train_dump_dir,
-            mel_extractor,
+            config=config,
+            fps=train_wav_files,
+            sentences=sentences,
+            output_dir=train_dump_dir,
+            mel_extractor=mel_extractor,
            nprocs=args.num_cpu)
    if dev_wav_files:
        process_sentences(
-            config,
-            dev_wav_files,
-            sentences,
-            dev_dump_dir,
-            mel_extractor,
+            config=config,
+            fps=dev_wav_files,
+            sentences=sentences,
+            output_dir=dev_dump_dir,
+            mel_extractor=mel_extractor,
            nprocs=args.num_cpu)
    if test_wav_files:
        process_sentences(
-            config,
-            test_wav_files,
-            sentences,
-            test_dump_dir,
-            mel_extractor,
+            config=config,
+            fps=test_wav_files,
+            sentences=sentences,
+            output_dir=test_dump_dir,
+            mel_extractor=mel_extractor,
            nprocs=args.num_cpu)



--- a/paddlespeech/t2s/exps/vits/normalize.py
+++ b/paddlespeech/t2s/exps/vits/normalize.py
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Normalize feature files and dump them."""
+import argparse
+import logging
+from operator import itemgetter
+from pathlib import Path
+
+import jsonlines
+import numpy as np
+from sklearn.preprocessing import StandardScaler
+from tqdm import tqdm
+
+from paddlespeech.t2s.datasets.data_table import DataTable
+
+
+def main():
+    """Run preprocessing process."""
+    parser = argparse.ArgumentParser(
+        description="Normalize dumped raw features (See detail in parallel_wavegan/bin/normalize.py)."
+    )
+    parser.add_argument(
+        "--metadata",
+        type=str,
+        required=True,
+        help="directory including feature files to be normalized. "
+        "you need to specify either *-scp or rootdir.")
+
+    parser.add_argument(
+        "--dumpdir",
+        type=str,
+        required=True,
+        help="directory to dump normalized feature files.")
+    parser.add_argument(
+        "--feats-stats",
+        type=str,
+        required=True,
+        help="speech statistics file.")
+    parser.add_argument(
+        "--skip-wav-copy",
+        default=False,
+        action="store_true",
+        help="whether to skip the copy of wav files.")
+
+    parser.add_argument(
+        "--phones-dict", type=str, default=None, help="phone vocabulary file.")
+    parser.add_argument(
+        "--speaker-dict", type=str, default=None, help="speaker id map file.")
+    parser.add_argument(
+        "--verbose",
+        type=int,
+        default=1,
+        help="logging level. higher is more logging. (default=1)")
+    args = parser.parse_args()
+
+    # set logger
+    if args.verbose > 1:
+        logging.basicConfig(
+            level=logging.DEBUG,
+            format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s"
+        )
+    elif args.verbose > 0:
+        logging.basicConfig(
+            level=logging.INFO,
+            format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s"
+        )
+    else:
+        logging.basicConfig(
+            level=logging.WARN,
+            format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s"
+        )
+        logging.warning('Skip DEBUG/INFO messages')
+
+    dumpdir = Path(args.dumpdir).expanduser()
+    # use absolute path
+    dumpdir = dumpdir.resolve()
+    dumpdir.mkdir(parents=True, exist_ok=True)
+
+    # get dataset
+    with jsonlines.open(args.metadata, 'r') as reader:
+        metadata = list(reader)
+    dataset = DataTable(
+        metadata,
+        converters={
+            "feats": np.load,
+            "wave": None if args.skip_wav_copy else np.load,
+        })
+    logging.info(f"The number of files = {len(dataset)}.")
+
+    # restore scaler
+    feats_scaler = StandardScaler()
+    feats_scaler.mean_ = np.load(args.feats_stats)[0]
+    feats_scaler.scale_ = np.load(args.feats_stats)[1]
+    feats_scaler.n_features_in_ = feats_scaler.mean_.shape[0]
+
+    vocab_phones = {}
+    with open(args.phones_dict, 'rt') as f:
+        phn_id = [line.strip().split() for line in f.readlines()]
+    for phn, id in phn_id:
+        vocab_phones[phn] = int(id)
+
+    vocab_speaker = {}
+    with open(args.speaker_dict, 'rt') as f:
+        spk_id = [line.strip().split() for line in f.readlines()]
+    for spk, id in spk_id:
+        vocab_speaker[spk] = int(id)
+
+    # process each file
+    output_metadata = []
+
+    for item in tqdm(dataset):
+        utt_id = item['utt_id']
+        feats = item['feats']
+        wave = item['wave']
+
+        # normalize
+        feats = feats_scaler.transform(feats)
+        feats_path = dumpdir / f"{utt_id}_feats.npy"
+        np.save(feats_path, feats.astype(np.float32), allow_pickle=False)
+
+        if not args.skip_wav_copy:
+            wav_path = dumpdir / f"{utt_id}_wave.npy"
+            np.save(wav_path, wave.astype(np.float32), allow_pickle=False)
+        else:
+            wav_path = wave
+
+        phone_ids = [vocab_phones[p] for p in item['phones']]
+        spk_id = vocab_speaker[item["speaker"]]
+
+        record = {
+            "utt_id": item['utt_id'],
+            "text": phone_ids,
+            "text_lengths": item['text_lengths'],
+            'feats': str(feats_path),
+            "feats_lengths": item['feats_lengths'],
+            "wave": str(wav_path),
+            "spk_id": spk_id,
+        }
+
+        # add spk_emb for voice cloning
+        if "spk_emb" in item:
+            record["spk_emb"] = str(item["spk_emb"])
+
+        output_metadata.append(record)
+    output_metadata.sort(key=itemgetter('utt_id'))
+    output_metadata_path = Path(args.dumpdir) / "metadata.jsonl"
+    with jsonlines.open(output_metadata_path, 'w') as writer:
+        for item in output_metadata:
+            writer.write(item)
+    logging.info(f"metadata dumped into {output_metadata_path}")
+
+
+if __name__ == "__main__":
+    main()
--- a/paddlespeech/t2s/exps/vits/preprocess.py
+++ b/paddlespeech/t2s/exps/vits/preprocess.py
--- a/paddlespeech/t2s/exps/vits/synthesize.py
+++ b/paddlespeech/t2s/exps/vits/synthesize.py
--- a/paddlespeech/t2s/exps/vits/synthesize_e2e.py
+++ b/paddlespeech/t2s/exps/vits/synthesize_e2e.py
--- a/paddlespeech/t2s/exps/vits/train.py
+++ b/paddlespeech/t2s/exps/vits/train.py
--- a/paddlespeech/t2s/exps/voice_cloning.py
+++ b/paddlespeech/t2s/exps/voice_cloning.py
--- a/paddlespeech/t2s/frontend/zh_frontend.py
+++ b/paddlespeech/t2s/frontend/zh_frontend.py
--- a/paddlespeech/t2s/models/__init__.py
+++ b/paddlespeech/t2s/models/__init__.py
--- a/paddlespeech/t2s/models/hifigan/hifigan.py
+++ b/paddlespeech/t2s/models/hifigan/hifigan.py
--- a/paddlespeech/t2s/models/parallel_wavegan/parallel_wavegan_updater.py
+++ b/paddlespeech/t2s/models/parallel_wavegan/parallel_wavegan_updater.py
--- a/paddlespeech/t2s/models/speedyspeech/speedyspeech_updater.py
+++ b/paddlespeech/t2s/models/speedyspeech/speedyspeech_updater.py
--- a/paddlespeech/t2s/models/vits/__init__.py
+++ b/paddlespeech/t2s/models/vits/__init__.py
--- a/paddlespeech/t2s/models/vits/duration_predictor.py
+++ b/paddlespeech/t2s/models/vits/duration_predictor.py
--- a/paddlespeech/t2s/models/vits/flow.py
+++ b/paddlespeech/t2s/models/vits/flow.py
--- a/paddlespeech/t2s/models/vits/generator.py
+++ b/paddlespeech/t2s/models/vits/generator.py
--- a/paddlespeech/t2s/models/vits/monotonic_align/__init__.py
+++ b/paddlespeech/t2s/models/vits/monotonic_align/__init__.py
--- a/paddlespeech/t2s/models/vits/monotonic_align/core.pyx
+++ b/paddlespeech/t2s/models/vits/monotonic_align/core.pyx
--- a/paddlespeech/t2s/models/vits/monotonic_align/setup.py
+++ b/paddlespeech/t2s/models/vits/monotonic_align/setup.py
--- a/paddlespeech/t2s/models/vits/posterior_encoder.py
+++ b/paddlespeech/t2s/models/vits/posterior_encoder.py
--- a/paddlespeech/t2s/models/vits/residual_coupling.py
+++ b/paddlespeech/t2s/models/vits/residual_coupling.py
--- a/paddlespeech/t2s/models/vits/text_encoder.py
+++ b/paddlespeech/t2s/models/vits/text_encoder.py
--- a/paddlespeech/t2s/models/vits/transform.py
+++ b/paddlespeech/t2s/models/vits/transform.py
--- a/paddlespeech/t2s/models/vits/vits.py
+++ b/paddlespeech/t2s/models/vits/vits.py
--- a/paddlespeech/t2s/models/vits/vits_updater.py
+++ b/paddlespeech/t2s/models/vits/vits_updater.py
--- a/paddlespeech/t2s/models/vits/wavenet/__init__.py
+++ b/paddlespeech/t2s/models/vits/wavenet/__init__.py
--- a/paddlespeech/t2s/models/vits/wavenet/residual_block.py
+++ b/paddlespeech/t2s/models/vits/wavenet/residual_block.py
--- a/paddlespeech/t2s/models/vits/wavenet/wavenet.py
+++ b/paddlespeech/t2s/models/vits/wavenet/wavenet.py
--- a/paddlespeech/t2s/modules/losses.py
+++ b/paddlespeech/t2s/modules/losses.py
--- a/paddlespeech/t2s/modules/nets_utils.py
+++ b/paddlespeech/t2s/modules/nets_utils.py
--- a/paddlespeech/t2s/modules/transformer/repeat.py
+++ b/paddlespeech/t2s/modules/transformer/repeat.py
--- a/paddlespeech/t2s/training/optimizer.py
+++ b/paddlespeech/t2s/training/optimizer.py
--- a/paddlespeech/t2s/utils/timeline.py
+++ b/paddlespeech/t2s/utils/timeline.py
--- a/paddlespeech/t2s/utils/profile.py
+++ b/paddlespeech/t2s/utils/profile.py
--- a/paddlespeech/utils/dynamic_import.py
+++ b/paddlespeech/utils/dynamic_import.py
--- a/setup.py
+++ b/setup.py
--- a/tests/test_tipc/prepare.sh
+++ b/tests/test_tipc/prepare.sh
--- a/tests/unit/cli/aishell_test_prepare.py
+++ b/tests/unit/cli/aishell_test_prepare.py