Merge branch 'develop' of github.com:PaddlePaddle/PaddleSpeech into add_onnx

c765fca6 · 小湉湉 · 21c75684 · 1843bed4 · c765fca6 · c765fca6
20 changed file
--- a/README.md
+++ b/README.md
@@ -180,7 +180,7 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
 2021.12.14: We would like to have an online courses to introduce basics and research of speech, as well as code practice with `paddlespeech`. Please pay attention to our [Calendar](https://www.paddlepaddle.org.cn/live).
 --->
 - 👏🏻  2022.03.28: PaddleSpeech Server is available for Audio Classification, Automatic Speech Recognition and Text-to-Speech.
- 👏🏻  2022.03.28: PaddleSpeech CLI is available for Speaker Verfication.
+- 👏🏻  2022.03.28: PaddleSpeech CLI is available for Speaker Verification.
 - 🤗  2021.12.14: Our PaddleSpeech [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) and [TTS](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS) Demos on Hugging Face Spaces are available!
 - 👏🏻  2021.12.10: PaddleSpeech CLI is available for Audio Classification, Automatic Speech Recognition, Speech Translation (English to Chinese) and Text-to-Speech.
@@ -280,10 +280,14 @@ paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav
 For more information about server command lines, please see: [speech server demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/speech_server)
+<a name="ModelList"></a>
 ## Model List
 PaddleSpeech supports a series of most popular models. They are summarized in [released models](./docs/source/released_model.md) and attached with available pretrained models.
+<a name="SpeechToText"></a>
 **Speech-to-Text** contains *Acoustic Model*, *Language Model*, and *Speech Translation*, with the following details:
 <table style="width:100%">
@@ -357,6 +361,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
  </tbody>
 </table>
+<a name="TextToSpeech"></a>
 **Text-to-Speech** in PaddleSpeech mainly contains three modules: *Text Frontend*, *Acoustic Model* and *Vocoder*. Acoustic Model and Vocoder models are listed as follow:
 <table>
@@ -473,6 +479,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
  </tbody>
 </table>
+<a name="AudioClassification"></a>
 **Audio Classification**
 <table style="width:100%">
@@ -496,6 +504,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
  </tbody>
 </table>
+<a name="SpeakerVerification"></a>
 **Speaker Verification**
 <table style="width:100%">
@@ -519,6 +529,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
  </tbody>
 </table>
+<a name="PunctuationRestoration"></a>
 **Punctuation Restoration**
 <table style="width:100%">
@@ -559,10 +571,18 @@ Normally, [Speech SoTA](https://paperswithcode.com/area/speech), [Audio SoTA](ht
    - [Advanced Usage](./docs/source/tts/advanced_usage.md)
    - [Chinese Rule Based Text Frontend](./docs/source/tts/zh_text_frontend.md)
    - [Test Audio Samples](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html)
+  - Speaker Verification
+    - [Audio Searching](./demos/audio_searching/README.md)
+    - [Speaker Verification](./demos/speaker_verification/README.md)
  - [Audio Classification](./demos/audio_tagging/README.md)
-  - [Speaker Verification](./demos/speaker_verification/README.md)
  - [Speech Translation](./demos/speech_translation/README.md)
+  - [Speech Server](./demos/speech_server/README.md)
 - [Released Models](./docs/source/released_model.md)
+  - [Speech-to-Text](#SpeechToText)
+  - [Text-to-Speech](#TextToSpeech)
+  - [Audio Classification](#AudioClassification)
+  - [Speaker Verification](#SpeakerVerification)
+  - [Punctuation Restoration](#PunctuationRestoration)
 - [Community](#Community)
 - [Welcome to contribute](#contribution)
 - [License](#License)

--- a/README_cn.md
+++ b/README_cn.md
@@ -273,6 +273,8 @@ paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav
 ## 模型列表
 PaddleSpeech 支持很多主流的模型，并提供了预训练模型，详情请见[模型列表](./docs/source/released_model.md)。
+<a name="语音识别模型"></a>
 PaddleSpeech 的 **语音转文本** 包含语音识别声学模型、语音识别语言模型和语音翻译, 详情如下：
 <table style="width:100%">
@@ -347,6 +349,7 @@ PaddleSpeech 的 **语音转文本** 包含语音识别声学模型、语音识
 </table>
 <a name="语音合成模型"></a>
 PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声学模型和声码器。声学模型和声码器模型如下：
 <table>
@@ -488,6 +491,8 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
 </table>
+<a name="声纹识别模型"></a>
 **声纹识别**
 <table style="width:100%">
@@ -511,6 +516,8 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
  </tbody>
 </table>
+<a name="标点恢复模型"></a>
 **标点恢复**
 <table style="width:100%">
@@ -556,13 +563,18 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
    - [进阶用法](./docs/source/tts/advanced_usage.md)
    - [中文文本前端](./docs/source/tts/zh_text_frontend.md)
    - [测试语音样本](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html)
+  - 声纹识别
+    - [声纹识别](./demos/speaker_verification/README_cn.md)
+    - [音频检索](./demos/audio_searching/README_cn.md)
  - [声音分类](./demos/audio_tagging/README_cn.md)
-  - [声纹识别](./demos/speaker_verification/README_cn.md)
  - [语音翻译](./demos/speech_translation/README_cn.md)
+  - [服务化部署](./demos/speech_server/README_cn.md)
 - [模型列表](#模型列表)
  - [语音识别](#语音识别模型)
  - [语音合成](#语音合成模型)
  - [声音分类](#声音分类模型)
+  - [声纹识别](#声纹识别模型)
+  - [标点恢复](#标点恢复模型)
 - [技术交流群](#技术交流群)
 - [欢迎贡献](#欢迎贡献)
 - [License](#License)

--- a/demos/speaker_verification/README.md
+++ b/demos/speaker_verification/README.md
@@ -38,6 +38,7 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  ```
  Arguments:
  - `input`(required): Audio file to recognize.
+  - `task` (required): Specify `vector` task. Default `spk`。
  - `model`: Model type of vector task. Default: `ecapatdnn_voxceleb12`.
  - `sample_rate`: Sample rate of the model. Default: `16000`.
  - `config`: Config of vector task. Use pretrained model when it is None. Default: `None`.
@@ -47,45 +48,45 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  Output:
  ```bash
-    demo [ -5.749211     9.505463    -8.200284    -5.2075014    5.3940268
+    demo [  1.4217498    5.626253    -5.342073     1.1773866    3.308055
-      -3.04878      1.611095    10.127234   -10.534177   -15.821609
+    1.756596     5.167894    10.80636     -3.8226728   -5.6141334
-      1.2032688   -0.35080156   1.2629458  -12.643498    -2.5758228
+    2.623845    -0.8072968    1.9635103   -7.3128724    0.01103897
-    -11.343508     2.3385992   -8.719341    14.213509    15.404744
+    -9.723131     0.6619743   -6.976803    10.213478     7.494748
-      -0.39327756   6.338786     2.688887     8.7104025   17.469526
+    2.9105635    3.8949256    3.7999806    7.1061673   16.905321
-      -8.77959      7.0576906    4.648855    -1.3089896  -23.294737
+    -7.1493764    8.733103     3.4230042   -4.831653   -11.403367
-      8.013747    13.891729    -9.926753     5.655307    -5.9422326
+    11.232214     7.1274667   -4.2828417    2.452362    -5.130748
-    -22.842539     0.6293588  -18.46266    -10.811862     9.8192625
+    -18.177666    -2.6116815  -11.000337    -6.7314315    1.6564683
-      3.0070958    3.8072643   -2.3861165    3.0821571  -14.739942
+    0.7618269    1.1253023   -2.083836     4.725744    -8.782597
-      1.7594414   -0.6485091    4.485623     2.0207152    7.264915
+    -3.539873     3.814236     5.1420674    2.162061     4.096431
-      -6.40137     23.63524      2.9711294  -22.708025     9.93719
+    -6.4162116   12.747448     1.9429878  -15.152943     6.417416
-      20.354511   -10.324688    -0.700492    -8.783211    -5.27593
+    16.097002    -9.716668    -1.9920526   -3.3649497   -1.871939
-      15.999649     3.3004563   12.747926    15.429879     4.7849145
+    11.567354     3.69788     11.258265     7.442363     9.183411
-      5.6699696   -2.3826702   10.605882     3.9112158    3.1500628
+    4.5281515   -1.2417862    4.3959084    6.6727695    5.8898783
-      15.859915    -2.1832209  -23.908653    -6.4799504   -4.5365124
+    7.627124    -0.66919386 -11.889693    -9.208865    -7.4274073
-      -9.224193    14.568347   -10.568833     4.982321    -4.342062
+    -3.7776625    6.917234    -9.848748    -2.0944717   -5.135116
-      0.0914714   12.645902    -5.74285     -3.2141201   -2.7173362
+    0.49563864   9.317534    -5.9141874   -1.8098574   -0.11738578
-      -6.680575     0.4757669   -5.035051    -6.7964664   16.865469
+    -7.169265    -1.0578263   -5.7216787   -5.1173844   16.137651
-    -11.54324      7.681869     0.44475392   9.708182    -8.932846
+    -4.473626     7.6624317   -0.55381083   9.631587    -6.4704556
-      0.4123232   -4.361452     1.3948607    9.511665     0.11667654
+    -8.548508     4.3716145   -0.79702514   4.478997    -2.9758704
-      2.9079323    6.049952     9.275183   -18.078873     6.2983274
+    3.272176     2.8382776    5.134597    -9.190781    -0.5657382
-      -0.7500531   -2.725033    -7.6027865    3.3404543    2.990815
+    -4.8745747    2.3165567   -5.984303    -2.1798875    0.35541576
-      4.010979    11.000591    -2.8873312    7.1352735  -16.79663
+    -0.31784213   9.493548     2.1144536    4.358092   -12.089823
-      18.495346   -14.293832     7.89578      2.2714825   22.976387
+    8.451689    -7.925461     4.6242585    4.4289427   18.692003
-      -4.875734    -3.0836344   -2.9999814   13.751918     6.448228
+    -2.6204622   -5.149185    -0.35821092   8.488551     4.981496
-    -11.924197     2.171869     2.0423572   -6.173772    10.778437
+    -9.32683     -2.2544234    6.6417594    1.2119585   10.977129
-      25.77281     -4.9495463   14.57806      0.3044315    2.6132357
+    16.555033     3.3238444    9.551863    -1.6676947   -0.79539716
-      -7.591999    -2.076944     9.025118     1.7834753   -3.1799617
+    -8.605674    -0.47356385   2.6741948   -5.359179    -2.6673796
-      -4.9401326   23.465864     5.1685796   -9.018578     9.037825
+    0.66607     15.443222     4.740594    -3.4725387   11.592567
-      -4.4150195    6.859591   -12.274467    -0.88911164   5.186309
+    -2.054497     1.7361217   -8.265324    -9.30447      5.4068313
-      -3.9988663  -13.638606    -9.925445    -0.06329413  -3.6709652
+    -1.5180256   -7.746615    -6.089606     0.07112726  -0.34904733
-    -12.397416   -12.719869    -1.395601     2.1150916    5.7381287
+    -8.649895    -9.998958    -2.564841    -0.53999114   2.601808
-      -4.4691963   -3.82819     -0.84233856  -1.1604277  -13.490127
+    -0.31927416  -1.8815292   -2.07215     -3.4105783   -8.2998085
-      8.731719   -20.778936   -11.495662     5.8033476   -4.752041
+    1.483641   -15.365992    -8.288208     3.8847756   -3.4876456
-      10.833007    -6.717991     4.504732    13.4244375    1.1306485
+    7.3629923    0.4657332    3.132599    12.438889    -1.8337058
-      7.3435574    1.400918    14.704036    -9.501399     7.2315617
+    4.532936     2.7264361   10.145339    -6.521951     2.897153
-      -6.417456     1.3333273   11.872697    -0.30664724   8.8845
+    -3.3925855    5.079156     7.759716     4.677565     5.8457737
-      6.5569253    4.7948146    0.03662816  -8.704245     6.224871
+    2.402413     7.7071047    3.9711342   -6.390043     6.1268735
-      -3.2701402  -11.508579  ]
+    -3.7760346  -11.118123  ]
  ```
 - Python API
@@ -97,56 +98,57 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  audio_emb = vector_executor(
      model='ecapatdnn_voxceleb12',
      sample_rate=16000,
-      config=None, 
+      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
      ckpt_path=None,
      audio_file='./85236145389.wav',
-      force_yes=False,
      device=paddle.get_device())
  print('Audio embedding Result: \n{}'.format(audio_emb))
  ```
-  Output:
+  Output：
  ```bash
  # Vector Result:
-   [ -5.749211     9.505463    -8.200284    -5.2075014    5.3940268
+   Audio embedding Result:
-      -3.04878      1.611095    10.127234   -10.534177   -15.821609
+    [  1.4217498    5.626253    -5.342073     1.1773866    3.308055
-      1.2032688   -0.35080156   1.2629458  -12.643498    -2.5758228
+    1.756596     5.167894    10.80636     -3.8226728   -5.6141334
-    -11.343508     2.3385992   -8.719341    14.213509    15.404744
+    2.623845    -0.8072968    1.9635103   -7.3128724    0.01103897
-      -0.39327756   6.338786     2.688887     8.7104025   17.469526
+    -9.723131     0.6619743   -6.976803    10.213478     7.494748
-      -8.77959      7.0576906    4.648855    -1.3089896  -23.294737
+    2.9105635    3.8949256    3.7999806    7.1061673   16.905321
-      8.013747    13.891729    -9.926753     5.655307    -5.9422326
+    -7.1493764    8.733103     3.4230042   -4.831653   -11.403367
-    -22.842539     0.6293588  -18.46266    -10.811862     9.8192625
+    11.232214     7.1274667   -4.2828417    2.452362    -5.130748
-      3.0070958    3.8072643   -2.3861165    3.0821571  -14.739942
+    -18.177666    -2.6116815  -11.000337    -6.7314315    1.6564683
-      1.7594414   -0.6485091    4.485623     2.0207152    7.264915
+    0.7618269    1.1253023   -2.083836     4.725744    -8.782597
-      -6.40137     23.63524      2.9711294  -22.708025     9.93719
+    -3.539873     3.814236     5.1420674    2.162061     4.096431
-      20.354511   -10.324688    -0.700492    -8.783211    -5.27593
+    -6.4162116   12.747448     1.9429878  -15.152943     6.417416
-      15.999649     3.3004563   12.747926    15.429879     4.7849145
+    16.097002    -9.716668    -1.9920526   -3.3649497   -1.871939
-      5.6699696   -2.3826702   10.605882     3.9112158    3.1500628
+    11.567354     3.69788     11.258265     7.442363     9.183411
-      15.859915    -2.1832209  -23.908653    -6.4799504   -4.5365124
+    4.5281515   -1.2417862    4.3959084    6.6727695    5.8898783
-      -9.224193    14.568347   -10.568833     4.982321    -4.342062
+    7.627124    -0.66919386 -11.889693    -9.208865    -7.4274073
-      0.0914714   12.645902    -5.74285     -3.2141201   -2.7173362
+    -3.7776625    6.917234    -9.848748    -2.0944717   -5.135116
-      -6.680575     0.4757669   -5.035051    -6.7964664   16.865469
+    0.49563864   9.317534    -5.9141874   -1.8098574   -0.11738578
-    -11.54324      7.681869     0.44475392   9.708182    -8.932846
+    -7.169265    -1.0578263   -5.7216787   -5.1173844   16.137651
-      0.4123232   -4.361452     1.3948607    9.511665     0.11667654
+    -4.473626     7.6624317   -0.55381083   9.631587    -6.4704556
-      2.9079323    6.049952     9.275183   -18.078873     6.2983274
+    -8.548508     4.3716145   -0.79702514   4.478997    -2.9758704
-      -0.7500531   -2.725033    -7.6027865    3.3404543    2.990815
+    3.272176     2.8382776    5.134597    -9.190781    -0.5657382
-      4.010979    11.000591    -2.8873312    7.1352735  -16.79663
+    -4.8745747    2.3165567   -5.984303    -2.1798875    0.35541576
-      18.495346   -14.293832     7.89578      2.2714825   22.976387
+    -0.31784213   9.493548     2.1144536    4.358092   -12.089823
-      -4.875734    -3.0836344   -2.9999814   13.751918     6.448228
+    8.451689    -7.925461     4.6242585    4.4289427   18.692003
-    -11.924197     2.171869     2.0423572   -6.173772    10.778437
+    -2.6204622   -5.149185    -0.35821092   8.488551     4.981496
-      25.77281     -4.9495463   14.57806      0.3044315    2.6132357
+    -9.32683     -2.2544234    6.6417594    1.2119585   10.977129
-      -7.591999    -2.076944     9.025118     1.7834753   -3.1799617
+    16.555033     3.3238444    9.551863    -1.6676947   -0.79539716
-      -4.9401326   23.465864     5.1685796   -9.018578     9.037825
+    -8.605674    -0.47356385   2.6741948   -5.359179    -2.6673796
-      -4.4150195    6.859591   -12.274467    -0.88911164   5.186309
+    0.66607     15.443222     4.740594    -3.4725387   11.592567
-      -3.9988663  -13.638606    -9.925445    -0.06329413  -3.6709652
+    -2.054497     1.7361217   -8.265324    -9.30447      5.4068313
-    -12.397416   -12.719869    -1.395601     2.1150916    5.7381287
+    -1.5180256   -7.746615    -6.089606     0.07112726  -0.34904733
-      -4.4691963   -3.82819     -0.84233856  -1.1604277  -13.490127
+    -8.649895    -9.998958    -2.564841    -0.53999114   2.601808
-      8.731719   -20.778936   -11.495662     5.8033476   -4.752041
+    -0.31927416  -1.8815292   -2.07215     -3.4105783   -8.2998085
-      10.833007    -6.717991     4.504732    13.4244375    1.1306485
+    1.483641   -15.365992    -8.288208     3.8847756   -3.4876456
-      7.3435574    1.400918    14.704036    -9.501399     7.2315617
+    7.3629923    0.4657332    3.132599    12.438889    -1.8337058
-      -6.417456     1.3333273   11.872697    -0.30664724   8.8845
+    4.532936     2.7264361   10.145339    -6.521951     2.897153
-      6.5569253    4.7948146    0.03662816  -8.704245     6.224871
+    -3.3925855    5.079156     7.759716     4.677565     5.8457737
-      -3.2701402  -11.508579  ]
+    2.402413     7.7071047    3.9711342   -6.390043     6.1268735
+    -3.7760346  -11.118123  ]
  ```
 ### 4.Pretrained Models

--- a/demos/speaker_verification/README_cn.md
+++ b/demos/speaker_verification/README_cn.md
@@ -37,6 +37,7 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  ```
  参数：
  - `input`(必须输入)：用于识别的音频文件。
+  - `task` (必须输入): 用于指定 `vector` 处理的具体任务，默认是 `spk`。
  - `model`：声纹任务的模型，默认值：`ecapatdnn_voxceleb12`。
  - `sample_rate`：音频采样率，默认值：`16000`。
  - `config`：声纹任务的参数文件，若不设置则使用预训练模型中的默认配置，默认值：`None`。
@@ -45,45 +46,45 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  输出：
  ```bash
-  demo  [ -5.749211     9.505463    -8.200284    -5.2075014    5.3940268
+  demo  [  1.4217498    5.626253    -5.342073     1.1773866    3.308055
-      -3.04878      1.611095    10.127234   -10.534177   -15.821609
+    1.756596     5.167894    10.80636     -3.8226728   -5.6141334
-      1.2032688   -0.35080156   1.2629458  -12.643498    -2.5758228
+    2.623845    -0.8072968    1.9635103   -7.3128724    0.01103897
-    -11.343508     2.3385992   -8.719341    14.213509    15.404744
+    -9.723131     0.6619743   -6.976803    10.213478     7.494748
-      -0.39327756   6.338786     2.688887     8.7104025   17.469526
+    2.9105635    3.8949256    3.7999806    7.1061673   16.905321
-      -8.77959      7.0576906    4.648855    -1.3089896  -23.294737
+    -7.1493764    8.733103     3.4230042   -4.831653   -11.403367
-      8.013747    13.891729    -9.926753     5.655307    -5.9422326
+    11.232214     7.1274667   -4.2828417    2.452362    -5.130748
-    -22.842539     0.6293588  -18.46266    -10.811862     9.8192625
+    -18.177666    -2.6116815  -11.000337    -6.7314315    1.6564683
-      3.0070958    3.8072643   -2.3861165    3.0821571  -14.739942
+    0.7618269    1.1253023   -2.083836     4.725744    -8.782597
-      1.7594414   -0.6485091    4.485623     2.0207152    7.264915
+    -3.539873     3.814236     5.1420674    2.162061     4.096431
-      -6.40137     23.63524      2.9711294  -22.708025     9.93719
+    -6.4162116   12.747448     1.9429878  -15.152943     6.417416
-      20.354511   -10.324688    -0.700492    -8.783211    -5.27593
+    16.097002    -9.716668    -1.9920526   -3.3649497   -1.871939
-      15.999649     3.3004563   12.747926    15.429879     4.7849145
+    11.567354     3.69788     11.258265     7.442363     9.183411
-      5.6699696   -2.3826702   10.605882     3.9112158    3.1500628
+    4.5281515   -1.2417862    4.3959084    6.6727695    5.8898783
-      15.859915    -2.1832209  -23.908653    -6.4799504   -4.5365124
+    7.627124    -0.66919386 -11.889693    -9.208865    -7.4274073
-      -9.224193    14.568347   -10.568833     4.982321    -4.342062
+    -3.7776625    6.917234    -9.848748    -2.0944717   -5.135116
-      0.0914714   12.645902    -5.74285     -3.2141201   -2.7173362
+    0.49563864   9.317534    -5.9141874   -1.8098574   -0.11738578
-      -6.680575     0.4757669   -5.035051    -6.7964664   16.865469
+    -7.169265    -1.0578263   -5.7216787   -5.1173844   16.137651
-    -11.54324      7.681869     0.44475392   9.708182    -8.932846
+    -4.473626     7.6624317   -0.55381083   9.631587    -6.4704556
-      0.4123232   -4.361452     1.3948607    9.511665     0.11667654
+    -8.548508     4.3716145   -0.79702514   4.478997    -2.9758704
-      2.9079323    6.049952     9.275183   -18.078873     6.2983274
+    3.272176     2.8382776    5.134597    -9.190781    -0.5657382
-      -0.7500531   -2.725033    -7.6027865    3.3404543    2.990815
+    -4.8745747    2.3165567   -5.984303    -2.1798875    0.35541576
-      4.010979    11.000591    -2.8873312    7.1352735  -16.79663
+    -0.31784213   9.493548     2.1144536    4.358092   -12.089823
-      18.495346   -14.293832     7.89578      2.2714825   22.976387
+    8.451689    -7.925461     4.6242585    4.4289427   18.692003
-      -4.875734    -3.0836344   -2.9999814   13.751918     6.448228
+    -2.6204622   -5.149185    -0.35821092   8.488551     4.981496
-    -11.924197     2.171869     2.0423572   -6.173772    10.778437
+    -9.32683     -2.2544234    6.6417594    1.2119585   10.977129
-      25.77281     -4.9495463   14.57806      0.3044315    2.6132357
+    16.555033     3.3238444    9.551863    -1.6676947   -0.79539716
-      -7.591999    -2.076944     9.025118     1.7834753   -3.1799617
+    -8.605674    -0.47356385   2.6741948   -5.359179    -2.6673796
-      -4.9401326   23.465864     5.1685796   -9.018578     9.037825
+    0.66607     15.443222     4.740594    -3.4725387   11.592567
-      -4.4150195    6.859591   -12.274467    -0.88911164   5.186309
+    -2.054497     1.7361217   -8.265324    -9.30447      5.4068313
-      -3.9988663  -13.638606    -9.925445    -0.06329413  -3.6709652
+    -1.5180256   -7.746615    -6.089606     0.07112726  -0.34904733
-    -12.397416   -12.719869    -1.395601     2.1150916    5.7381287
+    -8.649895    -9.998958    -2.564841    -0.53999114   2.601808
-      -4.4691963   -3.82819     -0.84233856  -1.1604277  -13.490127
+    -0.31927416  -1.8815292   -2.07215     -3.4105783   -8.2998085
-      8.731719   -20.778936   -11.495662     5.8033476   -4.752041
+    1.483641   -15.365992    -8.288208     3.8847756   -3.4876456
-      10.833007    -6.717991     4.504732    13.4244375    1.1306485
+    7.3629923    0.4657332    3.132599    12.438889    -1.8337058
-      7.3435574    1.400918    14.704036    -9.501399     7.2315617
+    4.532936     2.7264361   10.145339    -6.521951     2.897153
-      -6.417456     1.3333273   11.872697    -0.30664724   8.8845
+    -3.3925855    5.079156     7.759716     4.677565     5.8457737
-      6.5569253    4.7948146    0.03662816  -8.704245     6.224871
+    2.402413     7.7071047    3.9711342   -6.390043     6.1268735
-      -3.2701402  -11.508579  ]
+    -3.7760346  -11.118123  ]
  ```
 - Python API
@@ -98,7 +99,6 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
      ckpt_path=None,
      audio_file='./85236145389.wav',
-      force_yes=False,
      device=paddle.get_device())
  print('Audio embedding Result: \n{}'.format(audio_emb))
  ```
@@ -106,45 +106,46 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  输出：
  ```bash
  # Vector Result:
-   [ -5.749211     9.505463    -8.200284    -5.2075014    5.3940268
+   Audio embedding Result:
-      -3.04878      1.611095    10.127234   -10.534177   -15.821609
+    [  1.4217498    5.626253    -5.342073     1.1773866    3.308055
-      1.2032688   -0.35080156   1.2629458  -12.643498    -2.5758228
+    1.756596     5.167894    10.80636     -3.8226728   -5.6141334
-    -11.343508     2.3385992   -8.719341    14.213509    15.404744
+    2.623845    -0.8072968    1.9635103   -7.3128724    0.01103897
-      -0.39327756   6.338786     2.688887     8.7104025   17.469526
+    -9.723131     0.6619743   -6.976803    10.213478     7.494748
-      -8.77959      7.0576906    4.648855    -1.3089896  -23.294737
+    2.9105635    3.8949256    3.7999806    7.1061673   16.905321
-      8.013747    13.891729    -9.926753     5.655307    -5.9422326
+    -7.1493764    8.733103     3.4230042   -4.831653   -11.403367
-    -22.842539     0.6293588  -18.46266    -10.811862     9.8192625
+    11.232214     7.1274667   -4.2828417    2.452362    -5.130748
-      3.0070958    3.8072643   -2.3861165    3.0821571  -14.739942
+    -18.177666    -2.6116815  -11.000337    -6.7314315    1.6564683
-      1.7594414   -0.6485091    4.485623     2.0207152    7.264915
+    0.7618269    1.1253023   -2.083836     4.725744    -8.782597
-      -6.40137     23.63524      2.9711294  -22.708025     9.93719
+    -3.539873     3.814236     5.1420674    2.162061     4.096431
-      20.354511   -10.324688    -0.700492    -8.783211    -5.27593
+    -6.4162116   12.747448     1.9429878  -15.152943     6.417416
-      15.999649     3.3004563   12.747926    15.429879     4.7849145
+    16.097002    -9.716668    -1.9920526   -3.3649497   -1.871939
-      5.6699696   -2.3826702   10.605882     3.9112158    3.1500628
+    11.567354     3.69788     11.258265     7.442363     9.183411
-      15.859915    -2.1832209  -23.908653    -6.4799504   -4.5365124
+    4.5281515   -1.2417862    4.3959084    6.6727695    5.8898783
-      -9.224193    14.568347   -10.568833     4.982321    -4.342062
+    7.627124    -0.66919386 -11.889693    -9.208865    -7.4274073
-      0.0914714   12.645902    -5.74285     -3.2141201   -2.7173362
+    -3.7776625    6.917234    -9.848748    -2.0944717   -5.135116
-      -6.680575     0.4757669   -5.035051    -6.7964664   16.865469
+    0.49563864   9.317534    -5.9141874   -1.8098574   -0.11738578
-    -11.54324      7.681869     0.44475392   9.708182    -8.932846
+    -7.169265    -1.0578263   -5.7216787   -5.1173844   16.137651
-      0.4123232   -4.361452     1.3948607    9.511665     0.11667654
+    -4.473626     7.6624317   -0.55381083   9.631587    -6.4704556
-      2.9079323    6.049952     9.275183   -18.078873     6.2983274
+    -8.548508     4.3716145   -0.79702514   4.478997    -2.9758704
-      -0.7500531   -2.725033    -7.6027865    3.3404543    2.990815
+    3.272176     2.8382776    5.134597    -9.190781    -0.5657382
-      4.010979    11.000591    -2.8873312    7.1352735  -16.79663
+    -4.8745747    2.3165567   -5.984303    -2.1798875    0.35541576
-      18.495346   -14.293832     7.89578      2.2714825   22.976387
+    -0.31784213   9.493548     2.1144536    4.358092   -12.089823
-      -4.875734    -3.0836344   -2.9999814   13.751918     6.448228
+    8.451689    -7.925461     4.6242585    4.4289427   18.692003
-    -11.924197     2.171869     2.0423572   -6.173772    10.778437
+    -2.6204622   -5.149185    -0.35821092   8.488551     4.981496
-      25.77281     -4.9495463   14.57806      0.3044315    2.6132357
+    -9.32683     -2.2544234    6.6417594    1.2119585   10.977129
-      -7.591999    -2.076944     9.025118     1.7834753   -3.1799617
+    16.555033     3.3238444    9.551863    -1.6676947   -0.79539716
-      -4.9401326   23.465864     5.1685796   -9.018578     9.037825
+    -8.605674    -0.47356385   2.6741948   -5.359179    -2.6673796
-      -4.4150195    6.859591   -12.274467    -0.88911164   5.186309
+    0.66607     15.443222     4.740594    -3.4725387   11.592567
-      -3.9988663  -13.638606    -9.925445    -0.06329413  -3.6709652
+    -2.054497     1.7361217   -8.265324    -9.30447      5.4068313
-    -12.397416   -12.719869    -1.395601     2.1150916    5.7381287
+    -1.5180256   -7.746615    -6.089606     0.07112726  -0.34904733
-      -4.4691963   -3.82819     -0.84233856  -1.1604277  -13.490127
+    -8.649895    -9.998958    -2.564841    -0.53999114   2.601808
-      8.731719   -20.778936   -11.495662     5.8033476   -4.752041
+    -0.31927416  -1.8815292   -2.07215     -3.4105783   -8.2998085
-      10.833007    -6.717991     4.504732    13.4244375    1.1306485
+    1.483641   -15.365992    -8.288208     3.8847756   -3.4876456
-      7.3435574    1.400918    14.704036    -9.501399     7.2315617
+    7.3629923    0.4657332    3.132599    12.438889    -1.8337058
-      -6.417456     1.3333273   11.872697    -0.30664724   8.8845
+    4.532936     2.7264361   10.145339    -6.521951     2.897153
-      6.5569253    4.7948146    0.03662816  -8.704245     6.224871
+    -3.3925855    5.079156     7.759716     4.677565     5.8457737
-      -3.2701402  -11.508579  ]
+    2.402413     7.7071047    3.9711342   -6.390043     6.1268735
+    -3.7760346  -11.118123  ]
  ```
 ### 4.预训练模型

--- a/demos/speaker_verification/run.sh
+++ b/demos/speaker_verification/run.sh
@@ -2,5 +2,5 @@
 wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
-# asr
+# vector
 paddlespeech vector --task spk --input ./85236145389.wav
\ No newline at end of file
--- a/docs/source/released_model.md
+++ b/docs/source/released_model.md
@@ -6,7 +6,7 @@
 ### Speech Recognition Model
 Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech | Example Link 
 :-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----:  | :-----:  | :-----: 
-[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 345 MB  | 2 Conv + 5 LSTM layers with only forward direction | 0.080 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) 
+[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 345 MB  | 2 Conv + 5 LSTM layers with only forward direction | 0.078 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) 
 [Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0) 
 [Conformer Online Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_chunk_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB  | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0565 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1) 
 [Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB  | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0483 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1) 
@@ -80,7 +80,7 @@ PANN | ESC-50 |[pann-esc50](../../examples/esc50/cls0)|[esc50_cnn6.tar.gz](https
 Model Type | Dataset| Example Link | Pretrained Models | Static Models 
 :-------------:| :------------:| :-----: | :-----: | :-----:
-PANN | VoxCeleb| [voxceleb_ecapatdnn](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/voxceleb/sv0) | [ecapatdnn.tar.gz](https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz) | -
+PANN | VoxCeleb| [voxceleb_ecapatdnn](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/voxceleb/sv0) | [ecapatdnn.tar.gz](https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz) | -
 ## Punctuation Restoration Models
 Model Type | Dataset| Example Link | Pretrained Models

--- a/examples/aishell/asr0/README.md
+++ b/examples/aishell/asr0/README.md
@@ -173,12 +173,7 @@ bash local/data.sh --stage 2  --stop_stage 2
 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1
 ```
-The performance of the released models are shown below:
+The performance of the released models are shown in [this](./RESULTS.md)
-|         Acoustic Model         |  Training Data  | Token-based |   Size | Descriptions                                       | CER   | WER  | Hours of speech |
-| :----------------------------: | :-------------: | :---------: | -----: | :------------------------------------------------- | :---- | :--- | :-------------- |
-| Ds2 Online Aishell ASR0 Model  | Aishell Dataset | Char-based  | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.080 | -    | 151 h           |
-| Ds2 Offline Aishell ASR0 Model | Aishell Dataset | Char-based  | 306 MB | 2 Conv + 3 bidirectional GRU layers                | 0.064 | -    | 151 h           |
 ## Stage 4: Static graph model Export
 This stage is to transform dygraph to static graph.
 ```bash

--- a/examples/aishell/asr0/RESULTS.md
+++ b/examples/aishell/asr0/RESULTS.md
@@ -4,15 +4,16 @@
 | Model | Number of Params | Release | Config | Test set | Valid Loss | CER | 
 | --- | --- | --- | --- | --- | --- | --- | 
-| DeepSpeech2 | 45.18M | 2.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.994938373565674 | 0.080 |  
+| DeepSpeech2 | 45.18M | r0.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.708217620849609| 0.078 |
+| DeepSpeech2 | 45.18M | v2.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.994938373565674 | 0.080 |  
 ## Deepspeech2 Non-Streaming
 | Model | Number of Params | Release | Config | Test set | Valid Loss | CER |  
 | --- | --- | --- | --- | --- | --- | --- |  
-| DeepSpeech2 | 58.4M | 2.2.0 | conf/deepspeech2.yaml + spec aug | test | 5.738585948944092 | 0.064000 |  
+| DeepSpeech2 | 58.4M | v2.2.0 | conf/deepspeech2.yaml + spec aug | test | 5.738585948944092 | 0.064000 |  
-| DeepSpeech2 | 58.4M | 2.1.0 | conf/deepspeech2.yaml + spec aug | test | 7.483316898345947 | 0.077860 |  
+| DeepSpeech2 | 58.4M | v2.1.0 | conf/deepspeech2.yaml + spec aug | test | 7.483316898345947 | 0.077860 |  
-| DeepSpeech2 | 58.4M | 2.1.0 | conf/deepspeech2.yaml | test | 7.299022197723389 | 0.078671 |
+| DeepSpeech2 | 58.4M | v2.1.0 | conf/deepspeech2.yaml | test | 7.299022197723389 | 0.078671 |
-| DeepSpeech2 | 58.4M | 2.0.0 | conf/deepspeech2.yaml | test | - | 0.078977 |  
+| DeepSpeech2 | 58.4M | v2.0.0 | conf/deepspeech2.yaml | test | - | 0.078977 |  
 | --- | --- | --- | --- | --- | --- | --- |  
-| DeepSpeech2 | 58.4M | 1.8.5 | - | test | - | 0.080447 |
+| DeepSpeech2 | 58.4M | v1.8.5 | - | test | - | 0.080447 |
--- a/examples/esc50/README.md
+++ b/examples/esc50/README.md
@@ -4,7 +4,7 @@
 对于声音分类任务，传统机器学习的一个常用做法是首先人工提取音频的时域和频域的多种特征并做特征选择、组合、变换等，然后基于SVM或决策树进行分类。而端到端的深度学习则通常利用深度网络如RNN，CNN等直接对声间波形(waveform)或时频特征(time-frequency)进行特征学习(representation learning)和分类预测。
-在IEEE ICASSP 2017 大会上，谷歌开放了一个大规模的音频数据集[Audioset](https://research.google.com/audioset/)。该数据集包含了 632 类的音频类别以及 2,084,320 条人工标记的每段 10 秒长度的声音剪辑片段（来源于YouTube视频）。目前该数据集已经有210万个已标注的视频数据，5800小时的音频数据，经过标记的声音样本的标签类别为527。
+在IEEE ICASSP 2017 大会上，谷歌开放了一个大规模的音频数据集[Audioset](https://research.google.com/audioset/)。该数据集包含了 632 类的音频类别以及 2,084,320 条人工标记的每段 **10 秒**长度的声音剪辑片段（来源于YouTube视频）。目前该数据集已经有 210万 个已标注的视频数据，5800 小时的音频数据，经过标记的声音样本的标签类别为 527。
 `PANNs`([PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition](https://arxiv.org/pdf/1912.10211.pdf))是基于Audioset数据集训练的声音分类/识别的模型。经过预训练后，模型可以用于提取音频的embbedding。本示例将使用`PANNs`的预训练模型Finetune完成声音分类的任务。
@@ -12,14 +12,14 @@
 ## 模型简介
 PaddleAudio提供了PANNs的CNN14、CNN10和CNN6的预训练模型，可供用户选择使用：
- CNN14: 该模型主要包含12个卷积层和2个全连接层，模型参数的数量为79.6M，embbedding维度是2048。
+- CNN14: 该模型主要包含12个卷积层和2个全连接层，模型参数的数量为 79.6M，embbedding维度是 2048。
- CNN10: 该模型主要包含8个卷积层和2个全连接层，模型参数的数量为4.9M，embbedding维度是512。
+- CNN10: 该模型主要包含8个卷积层和2个全连接层，模型参数的数量为 4.9M，embbedding维度是 512。
- CNN6: 该模型主要包含4个卷积层和2个全连接层，模型参数的数量为4.5M，embbedding维度是512。
+- CNN6: 该模型主要包含4个卷积层和2个全连接层，模型参数的数量为 4.5M，embbedding维度是 512。
 ## 数据集
-[ESC-50: Dataset for Environmental Sound Classification](https://github.com/karolpiczak/ESC-50) 是一个包含有 2000 个带标签的环境声音样本，音频样本采样率为 44,100Hz 的单通道音频文件，所有样本根据标签被划分为 50 个类别，每个类别有 40 个样本。
+[ESC-50: Dataset for Environmental Sound Classification](https://github.com/karolpiczak/ESC-50) 是一个包含有 2000 个带标签的时长为 **5 秒**的环境声音样本，音频样本采样率为 44,100Hz 的单通道音频文件，所有样本根据标签被划分为 50 个类别，每个类别有 40 个样本。
 ## 模型指标
@@ -43,13 +43,13 @@ $ CUDA_VISIBLE_DEVICES=0 ./run.sh 1 conf/panns.yaml
 ```
 训练的参数可在 `conf/panns.yaml` 的 `training` 中配置，其中：
- `epochs`: 训练轮次，默认为50。
+- `epochs`: 训练轮次，默认为 50。
 - `learning_rate`: Fine-tune的学习率；默认为5e-5。
- `batch_size`: 批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为16。
+- `batch_size`: 批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为 16。
 - `num_workers`: Dataloader获取数据的子进程数。默认为0，加载数据的流程在主进程执行。
 - `checkpoint_dir`: 模型参数文件和optimizer参数文件的保存目录，默认为`./checkpoint`。
- `save_freq`: 训练过程中的模型保存频率，默认为10。
+- `save_freq`: 训练过程中的模型保存频率，默认为 10。
- `log_freq`: 训练过程中的信息打印频率，默认为10。
+- `log_freq`: 训练过程中的信息打印频率，默认为 10。
 示例代码中使用的预训练模型为`CNN14`，如果想更换为其他预训练模型，可通过修改 `conf/panns.yaml` 的 `model` 中配置：
 ```yaml
@@ -76,7 +76,7 @@ $ CUDA_VISIBLE_DEVICES=0 ./run.sh 2 conf/panns.yaml
 训练的参数可在 `conf/panns.yaml` 的 `predicting` 中配置，其中：
 - `audio_file`: 指定预测的音频文件。
- `top_k`: 预测显示的top k标签的得分，默认为1。
+- `top_k`: 预测显示的top k标签的得分，默认为 1。
 - `checkpoint`: 模型参数checkpoint文件。
 输出的预测结果如下：

--- a/examples/iwslt2012/punc0/README.md
+++ b/examples/iwslt2012/punc0/README.md
@@ -21,7 +21,7 @@
 The pretrained model can be downloaded here [ernie_linear_p3_iwslt2012_zh_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/text/ernie_linear_p3_iwslt2012_zh_ckpt_0.1.1.zip).
 ### Test Result
- Ernie Linear
+- Ernie
    |       |COMMA  |  PERIOD | QUESTION | OVERALL|
    |:-----:|:-----:|:-----:|:-----:|:-----:|  
    |Precision  |0.510955  |0.526462  |0.820755  |0.619391|

--- a/examples/iwslt2012/punc0/RESULTS.md
+++ b/examples/iwslt2012/punc0/RESULTS.md
+# iwslt2012
+## Ernie
+|       |COMMA  |  PERIOD | QUESTION | OVERALL|
+|:-----:|:-----:|:-----:|:-----:|:-----:|  
+|Precision  |0.510955  |0.526462  |0.820755  |0.619391|
+|Recall     |0.517433  |0.564179  |0.861386  |0.647666|
+|F1         |0.514173  |0.544669  |0.840580  |0.633141|
--- a/examples/voxceleb/sv0/RESULT.md
+++ b/examples/voxceleb/sv0/RESULT.md
@@ -4,4 +4,4 @@
 | Model | Number of Params | Release | Config | dim | Test set |  Cosine | Cosine + S-Norm | 
 | --- | --- | --- | --- | --- | --- | --- | ---- |
-| ECAPA-TDNN | 85M | 0.1.1 | conf/ecapa_tdnn.yaml |192 | test | 1.15 |  1.06 | 
+| ECAPA-TDNN | 85M | 0.2.0 | conf/ecapa_tdnn.yaml |192 | test | 1.02 |  0.95 | 
--- a/paddleaudio/paddleaudio/metric/__init__.py
+++ b/paddleaudio/paddleaudio/metric/__init__.py
@@ -14,4 +14,3 @@
 from .dtw import dtw_distance
 from .eer import compute_eer
 from .eer import compute_minDCF
-from .mcd import mcd_distance
--- a/paddleaudio/paddleaudio/metric/mcd.py
+++ b/paddleaudio/paddleaudio/metric/mcd.py
@@ -11,53 +11,20 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-from typing import Callable
-import mcd.metrics_fast as mt
 import numpy as np
-from mcd import dtw
-__all__ = [
-    'mcd_distance',
-]
-def mcd_distance(xs: np.ndarray,
-                 ys: np.ndarray,
-                 cost_fn: Callable=mt.logSpecDbDist) -> float:
-    """Mel cepstral distortion (MCD), dtw distance.
-    Dynamic Time Warping.
-    Uses dynamic programming to compute:
-    Examples:
-        .. code-block:: python
-            wps[i, j] = cost_fn(xs[i], ys[j]) + min(
+def pcm16to32(audio: np.ndarray) -> np.ndarray:
-                            wps[i-1, j  ],  // vertical   / insertion / expansion
+    """pcm int16 to float32
-                            wps[i  , j-1],  // horizontal / deletion  / compression
-                            wps[i-1, j-1])  // diagonal   / match
-            dtw = sqrt(wps[-1, -1])
-    Cost Function:
-    Examples:
-        .. code-block:: python
-            logSpecDbConst = 10.0 / math.log(10.0) * math.sqrt(2.0)
-            def logSpecDbDist(x, y):
-                diff = x - y
-                return logSpecDbConst * math.sqrt(np.inner(diff, diff))
    Args:
-        xs (np.ndarray): ref sequence, [T,D]
+        audio (np.ndarray): Waveform with dtype of int16.
-        ys (np.ndarray): hyp sequence, [T,D]
-        cost_fn (Callable, optional): Cost function. Defaults to mt.logSpecDbDist.
    Returns:
-        float: dtw distance
+        np.ndarray: Waveform with dtype of float32.
    """
+    if audio.dtype == np.int16:
-    min_cost, path = dtw.dtw(xs, ys, cost_fn)
+        audio = audio.astype("float32")
-    return min_cost
+        bits = np.iinfo(np.int16).bits
+        audio = audio / (2**(bits - 1))
+    return audio
--- a/paddleaudio/setup.py
+++ b/paddleaudio/setup.py
@@ -19,7 +19,7 @@ from setuptools.command.install import install
 from setuptools.command.test import test
 # set the version here
-VERSION = '0.2.0'
+VERSION = '0.2.1'
 # Inspired by the example at https://pytest.org/latest/goodpractises.html
@@ -83,9 +83,8 @@ setuptools.setup(
    python_requires='>=3.6',
    install_requires=[
        'numpy >= 1.15.0', 'scipy >= 1.0.0', 'resampy >= 0.2.2',
-        'soundfile >= 0.9.0', 'colorlog', 'dtaidistance == 2.3.1', 'mcd >= 0.4',
+        'soundfile >= 0.9.0', 'colorlog', 'dtaidistance == 2.3.1', 'pathos'
-        'pathos'
+        ],
-    ],
    extras_require={
        'test': [
            'nose', 'librosa==0.8.1', 'soundfile==0.10.3.post1',

--- a/paddlespeech/cli/asr/infer.py
+++ b/paddlespeech/cli/asr/infer.py
@@ -80,9 +80,9 @@ pretrained_models = {
    },
    "deepspeech2online_aishell-zh-16k": {
        'url':
-        'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz',
+        'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz',
        'md5':
-        'd5e076217cf60486519f72c217d21b9b',
+        '23e16c69730a1cb5d735c98c83c21e16',
        'cfg_path':
        'model.yaml',
        'ckpt_path':
@@ -426,6 +426,11 @@ class ASRExecutor(BaseExecutor):
        try:
            audio, audio_sample_rate = soundfile.read(
                audio_file, dtype="int16", always_2d=True)
+            audio_duration = audio.shape[0] / audio_sample_rate
+            max_duration = 50.0
+            if audio_duration >= max_duration:
+                logger.error("Please input audio file less then 50 seconds.\n")
+                return
        except Exception as e:
            logger.exception(e)
            logger.error(

--- a/paddlespeech/cli/vector/infer.py
+++ b/paddlespeech/cli/vector/infer.py
@@ -42,9 +42,9 @@ pretrained_models = {
    # "paddlespeech vector --task spk --model ecapatdnn_voxceleb12-16k --sr 16000 --input ./input.wav"
    "ecapatdnn_voxceleb12-16k": {
        'url':
-        'https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz',
+        'https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz',
        'md5':
-        'a1c0dba7d4de997187786ff517d5b4ec',
+        'cc33023c54ab346cd318408f43fcaf95',
        'cfg_path':
        'conf/model.yaml',  # the yaml config path
        'ckpt_path':

--- a/paddlespeech/t2s/exps/fastspeech2/preprocess.py
+++ b/paddlespeech/t2s/exps/fastspeech2/preprocess.py
@@ -86,6 +86,9 @@ def process_sentence(config: Dict[str, Any],
        logmel = mel_extractor.get_log_mel_fbank(wav)
        # change duration according to mel_length
        compare_duration_and_mel_length(sentences, utt_id, logmel)
+        # utt_id may be popped in compare_duration_and_mel_length
+        if utt_id not in sentences:
+            return None
        phones = sentences[utt_id][0]
        durations = sentences[utt_id][1]
        num_frames = logmel.shape[0]

--- a/paddlespeech/t2s/exps/speedyspeech/preprocess.py
+++ b/paddlespeech/t2s/exps/speedyspeech/preprocess.py
@@ -79,6 +79,9 @@ def process_sentence(config: Dict[str, Any],
        logmel = mel_extractor.get_log_mel_fbank(wav)
        # change duration according to mel_length
        compare_duration_and_mel_length(sentences, utt_id, logmel)
+        # utt_id may be popped in compare_duration_and_mel_length
+        if utt_id not in sentences:
+            return None
        labels = sentences[utt_id][0]
        # extract phone and duration
        phones = []

--- a/paddlespeech/t2s/exps/tacotron2/preprocess.py
+++ b/paddlespeech/t2s/exps/tacotron2/preprocess.py
@@ -82,6 +82,9 @@ def process_sentence(config: Dict[str, Any],
        logmel = mel_extractor.get_log_mel_fbank(wav)
        # change duration according to mel_length
        compare_duration_and_mel_length(sentences, utt_id, logmel)
+        # utt_id may be popped in compare_duration_and_mel_length
+        if utt_id not in sentences:
+            return None
        phones = sentences[utt_id][0]
        durations = sentences[utt_id][1]
        num_frames = logmel.shape[0]