diff --git a/README.md b/README.md
index a90498293e6dd2b01aa9649105f1f2b075a3bf36..5093dbd678a895026a5021c655f944adc202054f 100644
--- a/README.md
+++ b/README.md
@@ -280,10 +280,14 @@ paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav
For more information about server command lines, please see: [speech server demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/speech_server)
+
+
## Model List
PaddleSpeech supports a series of most popular models. They are summarized in [released models](./docs/source/released_model.md) and attached with available pretrained models.
+
+
**Speech-to-Text** contains *Acoustic Model*, *Language Model*, and *Speech Translation*, with the following details:
@@ -357,6 +361,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
+
+
**Text-to-Speech** in PaddleSpeech mainly contains three modules: *Text Frontend*, *Acoustic Model* and *Vocoder*. Acoustic Model and Vocoder models are listed as follow:
@@ -457,10 +463,10 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
- GE2E + Tactron2 |
+ GE2E + Tacotron2 |
AISHELL-3 |
- ge2e-tactron2-aishell3
+ ge2e-tacotron2-aishell3
|
@@ -473,6 +479,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
+
+
**Audio Classification**
@@ -496,6 +504,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
+
+
**Speaker Verification**
@@ -519,6 +529,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
+
+
**Punctuation Restoration**
@@ -559,10 +571,18 @@ Normally, [Speech SoTA](https://paperswithcode.com/area/speech), [Audio SoTA](ht
- [Advanced Usage](./docs/source/tts/advanced_usage.md)
- [Chinese Rule Based Text Frontend](./docs/source/tts/zh_text_frontend.md)
- [Test Audio Samples](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html)
+ - Speaker Verification
+ - [Audio Searching](./demos/audio_searching/README.md)
+ - [Speaker Verification](./demos/speaker_verification/README.md)
- [Audio Classification](./demos/audio_tagging/README.md)
- - [Speaker Verification](./demos/speaker_verification/README.md)
- [Speech Translation](./demos/speech_translation/README.md)
+ - [Speech Server](./demos/speech_server/README.md)
- [Released Models](./docs/source/released_model.md)
+ - [Speech-to-Text](#SpeechToText)
+ - [Text-to-Speech](#TextToSpeech)
+ - [Audio Classification](#AudioClassification)
+ - [Speaker Verification](#SpeakerVerification)
+ - [Punctuation Restoration](#PunctuationRestoration)
- [Community](#Community)
- [Welcome to contribute](#contribution)
- [License](#License)
diff --git a/README_cn.md b/README_cn.md
index ab4ce6e6b878626011ac5cbcfb5c82b4b03ef5d6..5dab7fa0c034fe778c7f7c10e65a78fb6c3e52b5 100644
--- a/README_cn.md
+++ b/README_cn.md
@@ -273,6 +273,8 @@ paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav
## 模型列表
PaddleSpeech 支持很多主流的模型,并提供了预训练模型,详情请见[模型列表](./docs/source/released_model.md)。
+
+
PaddleSpeech 的 **语音转文本** 包含语音识别声学模型、语音识别语言模型和语音翻译, 详情如下:
@@ -347,6 +349,7 @@ PaddleSpeech 的 **语音转文本** 包含语音识别声学模型、语音识
+
PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声学模型和声码器。声学模型和声码器模型如下:
+
+
**声纹识别**
@@ -511,6 +516,8 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
+
+
**标点恢复**
@@ -556,13 +563,18 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
- [进阶用法](./docs/source/tts/advanced_usage.md)
- [中文文本前端](./docs/source/tts/zh_text_frontend.md)
- [测试语音样本](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html)
+ - 声纹识别
+ - [声纹识别](./demos/speaker_verification/README_cn.md)
+ - [音频检索](./demos/audio_searching/README_cn.md)
- [声音分类](./demos/audio_tagging/README_cn.md)
- - [声纹识别](./demos/speaker_verification/README_cn.md)
- [语音翻译](./demos/speech_translation/README_cn.md)
+ - [服务化部署](./demos/speech_server/README_cn.md)
- [模型列表](#模型列表)
- [语音识别](#语音识别模型)
- [语音合成](#语音合成模型)
- [声音分类](#声音分类模型)
+ - [声纹识别](#声纹识别模型)
+ - [标点恢复](#标点恢复模型)
- [技术交流群](#技术交流群)
- [欢迎贡献](#欢迎贡献)
- [License](#License)
diff --git a/demos/speaker_verification/README.md b/demos/speaker_verification/README.md
index 8739d402da97a576e5c1349fd01913e3c399911e..7d7180ae9df6ef2c34bd414bfe65ecfc7284fc60 100644
--- a/demos/speaker_verification/README.md
+++ b/demos/speaker_verification/README.md
@@ -30,6 +30,11 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
paddlespeech vector --task spk --input vec.job
echo -e "demo2 85236145389.wav \n demo3 85236145389.wav" | paddlespeech vector --task spk
+
+ paddlespeech vector --task score --input "./85236145389.wav ./123456789.wav"
+
+ echo -e "demo4 85236145389.wav 85236145389.wav \n demo5 85236145389.wav 123456789.wav" > vec.job
+ paddlespeech vector --task score --input vec.job
```
Usage:
@@ -38,6 +43,7 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
```
Arguments:
- `input`(required): Audio file to recognize.
+ - `task` (required): Specify `vector` task. Default `spk`。
- `model`: Model type of vector task. Default: `ecapatdnn_voxceleb12`.
- `sample_rate`: Sample rate of the model. Default: `16000`.
- `config`: Config of vector task. Use pretrained model when it is None. Default: `None`.
@@ -47,45 +53,45 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
Output:
```bash
- demo [ -5.749211 9.505463 -8.200284 -5.2075014 5.3940268
- -3.04878 1.611095 10.127234 -10.534177 -15.821609
- 1.2032688 -0.35080156 1.2629458 -12.643498 -2.5758228
- -11.343508 2.3385992 -8.719341 14.213509 15.404744
- -0.39327756 6.338786 2.688887 8.7104025 17.469526
- -8.77959 7.0576906 4.648855 -1.3089896 -23.294737
- 8.013747 13.891729 -9.926753 5.655307 -5.9422326
- -22.842539 0.6293588 -18.46266 -10.811862 9.8192625
- 3.0070958 3.8072643 -2.3861165 3.0821571 -14.739942
- 1.7594414 -0.6485091 4.485623 2.0207152 7.264915
- -6.40137 23.63524 2.9711294 -22.708025 9.93719
- 20.354511 -10.324688 -0.700492 -8.783211 -5.27593
- 15.999649 3.3004563 12.747926 15.429879 4.7849145
- 5.6699696 -2.3826702 10.605882 3.9112158 3.1500628
- 15.859915 -2.1832209 -23.908653 -6.4799504 -4.5365124
- -9.224193 14.568347 -10.568833 4.982321 -4.342062
- 0.0914714 12.645902 -5.74285 -3.2141201 -2.7173362
- -6.680575 0.4757669 -5.035051 -6.7964664 16.865469
- -11.54324 7.681869 0.44475392 9.708182 -8.932846
- 0.4123232 -4.361452 1.3948607 9.511665 0.11667654
- 2.9079323 6.049952 9.275183 -18.078873 6.2983274
- -0.7500531 -2.725033 -7.6027865 3.3404543 2.990815
- 4.010979 11.000591 -2.8873312 7.1352735 -16.79663
- 18.495346 -14.293832 7.89578 2.2714825 22.976387
- -4.875734 -3.0836344 -2.9999814 13.751918 6.448228
- -11.924197 2.171869 2.0423572 -6.173772 10.778437
- 25.77281 -4.9495463 14.57806 0.3044315 2.6132357
- -7.591999 -2.076944 9.025118 1.7834753 -3.1799617
- -4.9401326 23.465864 5.1685796 -9.018578 9.037825
- -4.4150195 6.859591 -12.274467 -0.88911164 5.186309
- -3.9988663 -13.638606 -9.925445 -0.06329413 -3.6709652
- -12.397416 -12.719869 -1.395601 2.1150916 5.7381287
- -4.4691963 -3.82819 -0.84233856 -1.1604277 -13.490127
- 8.731719 -20.778936 -11.495662 5.8033476 -4.752041
- 10.833007 -6.717991 4.504732 13.4244375 1.1306485
- 7.3435574 1.400918 14.704036 -9.501399 7.2315617
- -6.417456 1.3333273 11.872697 -0.30664724 8.8845
- 6.5569253 4.7948146 0.03662816 -8.704245 6.224871
- -3.2701402 -11.508579 ]
+ demo [ 1.4217498 5.626253 -5.342073 1.1773866 3.308055
+ 1.756596 5.167894 10.80636 -3.8226728 -5.6141334
+ 2.623845 -0.8072968 1.9635103 -7.3128724 0.01103897
+ -9.723131 0.6619743 -6.976803 10.213478 7.494748
+ 2.9105635 3.8949256 3.7999806 7.1061673 16.905321
+ -7.1493764 8.733103 3.4230042 -4.831653 -11.403367
+ 11.232214 7.1274667 -4.2828417 2.452362 -5.130748
+ -18.177666 -2.6116815 -11.000337 -6.7314315 1.6564683
+ 0.7618269 1.1253023 -2.083836 4.725744 -8.782597
+ -3.539873 3.814236 5.1420674 2.162061 4.096431
+ -6.4162116 12.747448 1.9429878 -15.152943 6.417416
+ 16.097002 -9.716668 -1.9920526 -3.3649497 -1.871939
+ 11.567354 3.69788 11.258265 7.442363 9.183411
+ 4.5281515 -1.2417862 4.3959084 6.6727695 5.8898783
+ 7.627124 -0.66919386 -11.889693 -9.208865 -7.4274073
+ -3.7776625 6.917234 -9.848748 -2.0944717 -5.135116
+ 0.49563864 9.317534 -5.9141874 -1.8098574 -0.11738578
+ -7.169265 -1.0578263 -5.7216787 -5.1173844 16.137651
+ -4.473626 7.6624317 -0.55381083 9.631587 -6.4704556
+ -8.548508 4.3716145 -0.79702514 4.478997 -2.9758704
+ 3.272176 2.8382776 5.134597 -9.190781 -0.5657382
+ -4.8745747 2.3165567 -5.984303 -2.1798875 0.35541576
+ -0.31784213 9.493548 2.1144536 4.358092 -12.089823
+ 8.451689 -7.925461 4.6242585 4.4289427 18.692003
+ -2.6204622 -5.149185 -0.35821092 8.488551 4.981496
+ -9.32683 -2.2544234 6.6417594 1.2119585 10.977129
+ 16.555033 3.3238444 9.551863 -1.6676947 -0.79539716
+ -8.605674 -0.47356385 2.6741948 -5.359179 -2.6673796
+ 0.66607 15.443222 4.740594 -3.4725387 11.592567
+ -2.054497 1.7361217 -8.265324 -9.30447 5.4068313
+ -1.5180256 -7.746615 -6.089606 0.07112726 -0.34904733
+ -8.649895 -9.998958 -2.564841 -0.53999114 2.601808
+ -0.31927416 -1.8815292 -2.07215 -3.4105783 -8.2998085
+ 1.483641 -15.365992 -8.288208 3.8847756 -3.4876456
+ 7.3629923 0.4657332 3.132599 12.438889 -1.8337058
+ 4.532936 2.7264361 10.145339 -6.521951 2.897153
+ -3.3925855 5.079156 7.759716 4.677565 5.8457737
+ 2.402413 7.7071047 3.9711342 -6.390043 6.1268735
+ -3.7760346 -11.118123 ]
```
- Python API
@@ -97,56 +103,113 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
audio_emb = vector_executor(
model='ecapatdnn_voxceleb12',
sample_rate=16000,
- config=None,
+ config=None, # Set `config` and `ckpt_path` to None to use pretrained model.
ckpt_path=None,
audio_file='./85236145389.wav',
- force_yes=False,
device=paddle.get_device())
print('Audio embedding Result: \n{}'.format(audio_emb))
+
+ test_emb = vector_executor(
+ model='ecapatdnn_voxceleb12',
+ sample_rate=16000,
+ config=None, # Set `config` and `ckpt_path` to None to use pretrained model.
+ ckpt_path=None,
+ audio_file='./123456789.wav',
+ device=paddle.get_device())
+ print('Test embedding Result: \n{}'.format(test_emb))
+
+ # score range [0, 1]
+ score = vector_executor.get_embeddings_score(audio_emb, test_emb)
+ print(f"Eembeddings Score: {score}")
```
- Output:
+ Output:
+
```bash
# Vector Result:
- [ -5.749211 9.505463 -8.200284 -5.2075014 5.3940268
- -3.04878 1.611095 10.127234 -10.534177 -15.821609
- 1.2032688 -0.35080156 1.2629458 -12.643498 -2.5758228
- -11.343508 2.3385992 -8.719341 14.213509 15.404744
- -0.39327756 6.338786 2.688887 8.7104025 17.469526
- -8.77959 7.0576906 4.648855 -1.3089896 -23.294737
- 8.013747 13.891729 -9.926753 5.655307 -5.9422326
- -22.842539 0.6293588 -18.46266 -10.811862 9.8192625
- 3.0070958 3.8072643 -2.3861165 3.0821571 -14.739942
- 1.7594414 -0.6485091 4.485623 2.0207152 7.264915
- -6.40137 23.63524 2.9711294 -22.708025 9.93719
- 20.354511 -10.324688 -0.700492 -8.783211 -5.27593
- 15.999649 3.3004563 12.747926 15.429879 4.7849145
- 5.6699696 -2.3826702 10.605882 3.9112158 3.1500628
- 15.859915 -2.1832209 -23.908653 -6.4799504 -4.5365124
- -9.224193 14.568347 -10.568833 4.982321 -4.342062
- 0.0914714 12.645902 -5.74285 -3.2141201 -2.7173362
- -6.680575 0.4757669 -5.035051 -6.7964664 16.865469
- -11.54324 7.681869 0.44475392 9.708182 -8.932846
- 0.4123232 -4.361452 1.3948607 9.511665 0.11667654
- 2.9079323 6.049952 9.275183 -18.078873 6.2983274
- -0.7500531 -2.725033 -7.6027865 3.3404543 2.990815
- 4.010979 11.000591 -2.8873312 7.1352735 -16.79663
- 18.495346 -14.293832 7.89578 2.2714825 22.976387
- -4.875734 -3.0836344 -2.9999814 13.751918 6.448228
- -11.924197 2.171869 2.0423572 -6.173772 10.778437
- 25.77281 -4.9495463 14.57806 0.3044315 2.6132357
- -7.591999 -2.076944 9.025118 1.7834753 -3.1799617
- -4.9401326 23.465864 5.1685796 -9.018578 9.037825
- -4.4150195 6.859591 -12.274467 -0.88911164 5.186309
- -3.9988663 -13.638606 -9.925445 -0.06329413 -3.6709652
- -12.397416 -12.719869 -1.395601 2.1150916 5.7381287
- -4.4691963 -3.82819 -0.84233856 -1.1604277 -13.490127
- 8.731719 -20.778936 -11.495662 5.8033476 -4.752041
- 10.833007 -6.717991 4.504732 13.4244375 1.1306485
- 7.3435574 1.400918 14.704036 -9.501399 7.2315617
- -6.417456 1.3333273 11.872697 -0.30664724 8.8845
- 6.5569253 4.7948146 0.03662816 -8.704245 6.224871
- -3.2701402 -11.508579 ]
+ Audio embedding Result:
+ [ 1.4217498 5.626253 -5.342073 1.1773866 3.308055
+ 1.756596 5.167894 10.80636 -3.8226728 -5.6141334
+ 2.623845 -0.8072968 1.9635103 -7.3128724 0.01103897
+ -9.723131 0.6619743 -6.976803 10.213478 7.494748
+ 2.9105635 3.8949256 3.7999806 7.1061673 16.905321
+ -7.1493764 8.733103 3.4230042 -4.831653 -11.403367
+ 11.232214 7.1274667 -4.2828417 2.452362 -5.130748
+ -18.177666 -2.6116815 -11.000337 -6.7314315 1.6564683
+ 0.7618269 1.1253023 -2.083836 4.725744 -8.782597
+ -3.539873 3.814236 5.1420674 2.162061 4.096431
+ -6.4162116 12.747448 1.9429878 -15.152943 6.417416
+ 16.097002 -9.716668 -1.9920526 -3.3649497 -1.871939
+ 11.567354 3.69788 11.258265 7.442363 9.183411
+ 4.5281515 -1.2417862 4.3959084 6.6727695 5.8898783
+ 7.627124 -0.66919386 -11.889693 -9.208865 -7.4274073
+ -3.7776625 6.917234 -9.848748 -2.0944717 -5.135116
+ 0.49563864 9.317534 -5.9141874 -1.8098574 -0.11738578
+ -7.169265 -1.0578263 -5.7216787 -5.1173844 16.137651
+ -4.473626 7.6624317 -0.55381083 9.631587 -6.4704556
+ -8.548508 4.3716145 -0.79702514 4.478997 -2.9758704
+ 3.272176 2.8382776 5.134597 -9.190781 -0.5657382
+ -4.8745747 2.3165567 -5.984303 -2.1798875 0.35541576
+ -0.31784213 9.493548 2.1144536 4.358092 -12.089823
+ 8.451689 -7.925461 4.6242585 4.4289427 18.692003
+ -2.6204622 -5.149185 -0.35821092 8.488551 4.981496
+ -9.32683 -2.2544234 6.6417594 1.2119585 10.977129
+ 16.555033 3.3238444 9.551863 -1.6676947 -0.79539716
+ -8.605674 -0.47356385 2.6741948 -5.359179 -2.6673796
+ 0.66607 15.443222 4.740594 -3.4725387 11.592567
+ -2.054497 1.7361217 -8.265324 -9.30447 5.4068313
+ -1.5180256 -7.746615 -6.089606 0.07112726 -0.34904733
+ -8.649895 -9.998958 -2.564841 -0.53999114 2.601808
+ -0.31927416 -1.8815292 -2.07215 -3.4105783 -8.2998085
+ 1.483641 -15.365992 -8.288208 3.8847756 -3.4876456
+ 7.3629923 0.4657332 3.132599 12.438889 -1.8337058
+ 4.532936 2.7264361 10.145339 -6.521951 2.897153
+ -3.3925855 5.079156 7.759716 4.677565 5.8457737
+ 2.402413 7.7071047 3.9711342 -6.390043 6.1268735
+ -3.7760346 -11.118123 ]
+ # get the test embedding
+ Test embedding Result:
+ [ -1.902964 2.0690894 -8.034194 3.5472693 0.18089125
+ 6.9085927 1.4097427 -1.9487704 -10.021278 -0.20755845
+ -8.04332 4.344489 2.3200977 -14.306299 5.184692
+ -11.55602 -3.8497238 0.6444722 1.2833948 2.6766639
+ 0.5878921 0.7946299 1.7207596 2.5791872 14.998469
+ -1.3385371 15.031221 -0.8006958 1.99287 -9.52007
+ 2.435466 4.003221 -4.33817 -4.898601 -5.304714
+ -18.033886 10.790787 -12.784645 -5.641755 2.9761686
+ -10.566622 1.4839455 6.152458 -5.7195854 2.8603241
+ 6.112133 8.489869 5.5958056 1.2836679 -1.2293907
+ 0.89927405 7.0288725 -2.854029 -0.9782962 5.8255906
+ 14.905906 -5.025907 0.7866458 -4.2444224 -16.354029
+ 10.521315 0.9604709 -3.3257897 7.144871 -13.592733
+ -8.568869 -1.7953678 0.26313916 10.916714 -6.9374123
+ 1.857403 -6.2746415 2.8154466 -7.2338667 -2.293357
+ -0.05452765 5.4287076 5.0849075 -6.690375 -1.6183422
+ 3.654291 0.94352573 -9.200294 -5.4749465 -3.5235846
+ 1.3420814 4.240421 -2.772944 -2.8451524 16.311104
+ 4.2969875 -1.762936 -12.5758915 8.595198 -0.8835239
+ -1.5708797 1.568961 1.1413603 3.5032008 -0.45251232
+ -6.786333 16.89443 5.3366146 -8.789056 0.6355629
+ 3.2579517 -3.328322 7.5969577 0.66025066 -6.550468
+ -9.148656 2.020372 -0.4615173 1.1965656 -3.8764873
+ 11.6562195 -6.0750933 12.182899 3.2218833 0.81969476
+ 5.570001 -3.8459578 -7.205299 7.9262037 -7.6611166
+ -5.249467 -2.2671914 7.2658715 -13.298164 4.821147
+ -2.7263982 11.691089 -3.8918593 -2.838112 -1.0336838
+ -3.8034165 2.8536487 -5.60398 -1.1972581 1.3455094
+ -3.4903061 2.2408795 5.5010734 -3.970756 11.99696
+ -7.8858757 0.43160373 -5.5059714 4.3426995 16.322706
+ 11.635366 0.72157705 -9.245714 -3.91465 -4.449838
+ -1.5716927 7.713747 -2.2430465 -6.198303 -13.481864
+ 2.8156567 -5.7812386 5.1456156 2.7289324 -14.505571
+ 13.270688 3.448231 -7.0659585 4.5886116 -4.466099
+ -0.296428 -11.463529 -2.6076477 14.110243 -6.9725137
+ -1.9962958 2.7119343 19.391657 0.01961198 14.607133
+ -1.6695905 -4.391516 1.3131028 -6.670972 -5.888604
+ 12.0612335 5.9285784 3.3715196 1.492534 10.723728
+ -0.95514804 -12.085431 ]
+ # get the score between enroll and test
+ Eembeddings Score: 0.4292638301849365
```
### 4.Pretrained Models
diff --git a/demos/speaker_verification/README_cn.md b/demos/speaker_verification/README_cn.md
index fe8949b3ca6d9de77e5095d6bc55844133b73f52..db382f298df74c73ef5fcbd5a3fb64fb2fa1c44f 100644
--- a/demos/speaker_verification/README_cn.md
+++ b/demos/speaker_verification/README_cn.md
@@ -29,6 +29,11 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
paddlespeech vector --task spk --input vec.job
echo -e "demo2 85236145389.wav \n demo3 85236145389.wav" | paddlespeech vector --task spk
+
+ paddlespeech vector --task score --input "./85236145389.wav ./123456789.wav"
+
+ echo -e "demo4 85236145389.wav 85236145389.wav \n demo5 85236145389.wav 123456789.wav" > vec.job
+ paddlespeech vector --task score --input vec.job
```
使用方法:
@@ -37,6 +42,7 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
```
参数:
- `input`(必须输入):用于识别的音频文件。
+ - `task` (必须输入): 用于指定 `vector` 处理的具体任务,默认是 `spk`。
- `model`:声纹任务的模型,默认值:`ecapatdnn_voxceleb12`。
- `sample_rate`:音频采样率,默认值:`16000`。
- `config`:声纹任务的参数文件,若不设置则使用预训练模型中的默认配置,默认值:`None`。
@@ -45,45 +51,45 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
输出:
```bash
- demo [ -5.749211 9.505463 -8.200284 -5.2075014 5.3940268
- -3.04878 1.611095 10.127234 -10.534177 -15.821609
- 1.2032688 -0.35080156 1.2629458 -12.643498 -2.5758228
- -11.343508 2.3385992 -8.719341 14.213509 15.404744
- -0.39327756 6.338786 2.688887 8.7104025 17.469526
- -8.77959 7.0576906 4.648855 -1.3089896 -23.294737
- 8.013747 13.891729 -9.926753 5.655307 -5.9422326
- -22.842539 0.6293588 -18.46266 -10.811862 9.8192625
- 3.0070958 3.8072643 -2.3861165 3.0821571 -14.739942
- 1.7594414 -0.6485091 4.485623 2.0207152 7.264915
- -6.40137 23.63524 2.9711294 -22.708025 9.93719
- 20.354511 -10.324688 -0.700492 -8.783211 -5.27593
- 15.999649 3.3004563 12.747926 15.429879 4.7849145
- 5.6699696 -2.3826702 10.605882 3.9112158 3.1500628
- 15.859915 -2.1832209 -23.908653 -6.4799504 -4.5365124
- -9.224193 14.568347 -10.568833 4.982321 -4.342062
- 0.0914714 12.645902 -5.74285 -3.2141201 -2.7173362
- -6.680575 0.4757669 -5.035051 -6.7964664 16.865469
- -11.54324 7.681869 0.44475392 9.708182 -8.932846
- 0.4123232 -4.361452 1.3948607 9.511665 0.11667654
- 2.9079323 6.049952 9.275183 -18.078873 6.2983274
- -0.7500531 -2.725033 -7.6027865 3.3404543 2.990815
- 4.010979 11.000591 -2.8873312 7.1352735 -16.79663
- 18.495346 -14.293832 7.89578 2.2714825 22.976387
- -4.875734 -3.0836344 -2.9999814 13.751918 6.448228
- -11.924197 2.171869 2.0423572 -6.173772 10.778437
- 25.77281 -4.9495463 14.57806 0.3044315 2.6132357
- -7.591999 -2.076944 9.025118 1.7834753 -3.1799617
- -4.9401326 23.465864 5.1685796 -9.018578 9.037825
- -4.4150195 6.859591 -12.274467 -0.88911164 5.186309
- -3.9988663 -13.638606 -9.925445 -0.06329413 -3.6709652
- -12.397416 -12.719869 -1.395601 2.1150916 5.7381287
- -4.4691963 -3.82819 -0.84233856 -1.1604277 -13.490127
- 8.731719 -20.778936 -11.495662 5.8033476 -4.752041
- 10.833007 -6.717991 4.504732 13.4244375 1.1306485
- 7.3435574 1.400918 14.704036 -9.501399 7.2315617
- -6.417456 1.3333273 11.872697 -0.30664724 8.8845
- 6.5569253 4.7948146 0.03662816 -8.704245 6.224871
- -3.2701402 -11.508579 ]
+ demo [ 1.4217498 5.626253 -5.342073 1.1773866 3.308055
+ 1.756596 5.167894 10.80636 -3.8226728 -5.6141334
+ 2.623845 -0.8072968 1.9635103 -7.3128724 0.01103897
+ -9.723131 0.6619743 -6.976803 10.213478 7.494748
+ 2.9105635 3.8949256 3.7999806 7.1061673 16.905321
+ -7.1493764 8.733103 3.4230042 -4.831653 -11.403367
+ 11.232214 7.1274667 -4.2828417 2.452362 -5.130748
+ -18.177666 -2.6116815 -11.000337 -6.7314315 1.6564683
+ 0.7618269 1.1253023 -2.083836 4.725744 -8.782597
+ -3.539873 3.814236 5.1420674 2.162061 4.096431
+ -6.4162116 12.747448 1.9429878 -15.152943 6.417416
+ 16.097002 -9.716668 -1.9920526 -3.3649497 -1.871939
+ 11.567354 3.69788 11.258265 7.442363 9.183411
+ 4.5281515 -1.2417862 4.3959084 6.6727695 5.8898783
+ 7.627124 -0.66919386 -11.889693 -9.208865 -7.4274073
+ -3.7776625 6.917234 -9.848748 -2.0944717 -5.135116
+ 0.49563864 9.317534 -5.9141874 -1.8098574 -0.11738578
+ -7.169265 -1.0578263 -5.7216787 -5.1173844 16.137651
+ -4.473626 7.6624317 -0.55381083 9.631587 -6.4704556
+ -8.548508 4.3716145 -0.79702514 4.478997 -2.9758704
+ 3.272176 2.8382776 5.134597 -9.190781 -0.5657382
+ -4.8745747 2.3165567 -5.984303 -2.1798875 0.35541576
+ -0.31784213 9.493548 2.1144536 4.358092 -12.089823
+ 8.451689 -7.925461 4.6242585 4.4289427 18.692003
+ -2.6204622 -5.149185 -0.35821092 8.488551 4.981496
+ -9.32683 -2.2544234 6.6417594 1.2119585 10.977129
+ 16.555033 3.3238444 9.551863 -1.6676947 -0.79539716
+ -8.605674 -0.47356385 2.6741948 -5.359179 -2.6673796
+ 0.66607 15.443222 4.740594 -3.4725387 11.592567
+ -2.054497 1.7361217 -8.265324 -9.30447 5.4068313
+ -1.5180256 -7.746615 -6.089606 0.07112726 -0.34904733
+ -8.649895 -9.998958 -2.564841 -0.53999114 2.601808
+ -0.31927416 -1.8815292 -2.07215 -3.4105783 -8.2998085
+ 1.483641 -15.365992 -8.288208 3.8847756 -3.4876456
+ 7.3629923 0.4657332 3.132599 12.438889 -1.8337058
+ 4.532936 2.7264361 10.145339 -6.521951 2.897153
+ -3.3925855 5.079156 7.759716 4.677565 5.8457737
+ 2.402413 7.7071047 3.9711342 -6.390043 6.1268735
+ -3.7760346 -11.118123 ]
```
- Python API
@@ -98,53 +104,109 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
config=None, # Set `config` and `ckpt_path` to None to use pretrained model.
ckpt_path=None,
audio_file='./85236145389.wav',
- force_yes=False,
device=paddle.get_device())
print('Audio embedding Result: \n{}'.format(audio_emb))
+
+ test_emb = vector_executor(
+ model='ecapatdnn_voxceleb12',
+ sample_rate=16000,
+ config=None, # Set `config` and `ckpt_path` to None to use pretrained model.
+ ckpt_path=None,
+ audio_file='./123456789.wav',
+ device=paddle.get_device())
+ print('Test embedding Result: \n{}'.format(test_emb))
+
+ # score range [0, 1]
+ score = vector_executor.get_embeddings_score(audio_emb, test_emb)
+ print(f"Eembeddings Score: {score}")
```
输出:
```bash
# Vector Result:
- [ -5.749211 9.505463 -8.200284 -5.2075014 5.3940268
- -3.04878 1.611095 10.127234 -10.534177 -15.821609
- 1.2032688 -0.35080156 1.2629458 -12.643498 -2.5758228
- -11.343508 2.3385992 -8.719341 14.213509 15.404744
- -0.39327756 6.338786 2.688887 8.7104025 17.469526
- -8.77959 7.0576906 4.648855 -1.3089896 -23.294737
- 8.013747 13.891729 -9.926753 5.655307 -5.9422326
- -22.842539 0.6293588 -18.46266 -10.811862 9.8192625
- 3.0070958 3.8072643 -2.3861165 3.0821571 -14.739942
- 1.7594414 -0.6485091 4.485623 2.0207152 7.264915
- -6.40137 23.63524 2.9711294 -22.708025 9.93719
- 20.354511 -10.324688 -0.700492 -8.783211 -5.27593
- 15.999649 3.3004563 12.747926 15.429879 4.7849145
- 5.6699696 -2.3826702 10.605882 3.9112158 3.1500628
- 15.859915 -2.1832209 -23.908653 -6.4799504 -4.5365124
- -9.224193 14.568347 -10.568833 4.982321 -4.342062
- 0.0914714 12.645902 -5.74285 -3.2141201 -2.7173362
- -6.680575 0.4757669 -5.035051 -6.7964664 16.865469
- -11.54324 7.681869 0.44475392 9.708182 -8.932846
- 0.4123232 -4.361452 1.3948607 9.511665 0.11667654
- 2.9079323 6.049952 9.275183 -18.078873 6.2983274
- -0.7500531 -2.725033 -7.6027865 3.3404543 2.990815
- 4.010979 11.000591 -2.8873312 7.1352735 -16.79663
- 18.495346 -14.293832 7.89578 2.2714825 22.976387
- -4.875734 -3.0836344 -2.9999814 13.751918 6.448228
- -11.924197 2.171869 2.0423572 -6.173772 10.778437
- 25.77281 -4.9495463 14.57806 0.3044315 2.6132357
- -7.591999 -2.076944 9.025118 1.7834753 -3.1799617
- -4.9401326 23.465864 5.1685796 -9.018578 9.037825
- -4.4150195 6.859591 -12.274467 -0.88911164 5.186309
- -3.9988663 -13.638606 -9.925445 -0.06329413 -3.6709652
- -12.397416 -12.719869 -1.395601 2.1150916 5.7381287
- -4.4691963 -3.82819 -0.84233856 -1.1604277 -13.490127
- 8.731719 -20.778936 -11.495662 5.8033476 -4.752041
- 10.833007 -6.717991 4.504732 13.4244375 1.1306485
- 7.3435574 1.400918 14.704036 -9.501399 7.2315617
- -6.417456 1.3333273 11.872697 -0.30664724 8.8845
- 6.5569253 4.7948146 0.03662816 -8.704245 6.224871
- -3.2701402 -11.508579 ]
+ Audio embedding Result:
+ [ 1.4217498 5.626253 -5.342073 1.1773866 3.308055
+ 1.756596 5.167894 10.80636 -3.8226728 -5.6141334
+ 2.623845 -0.8072968 1.9635103 -7.3128724 0.01103897
+ -9.723131 0.6619743 -6.976803 10.213478 7.494748
+ 2.9105635 3.8949256 3.7999806 7.1061673 16.905321
+ -7.1493764 8.733103 3.4230042 -4.831653 -11.403367
+ 11.232214 7.1274667 -4.2828417 2.452362 -5.130748
+ -18.177666 -2.6116815 -11.000337 -6.7314315 1.6564683
+ 0.7618269 1.1253023 -2.083836 4.725744 -8.782597
+ -3.539873 3.814236 5.1420674 2.162061 4.096431
+ -6.4162116 12.747448 1.9429878 -15.152943 6.417416
+ 16.097002 -9.716668 -1.9920526 -3.3649497 -1.871939
+ 11.567354 3.69788 11.258265 7.442363 9.183411
+ 4.5281515 -1.2417862 4.3959084 6.6727695 5.8898783
+ 7.627124 -0.66919386 -11.889693 -9.208865 -7.4274073
+ -3.7776625 6.917234 -9.848748 -2.0944717 -5.135116
+ 0.49563864 9.317534 -5.9141874 -1.8098574 -0.11738578
+ -7.169265 -1.0578263 -5.7216787 -5.1173844 16.137651
+ -4.473626 7.6624317 -0.55381083 9.631587 -6.4704556
+ -8.548508 4.3716145 -0.79702514 4.478997 -2.9758704
+ 3.272176 2.8382776 5.134597 -9.190781 -0.5657382
+ -4.8745747 2.3165567 -5.984303 -2.1798875 0.35541576
+ -0.31784213 9.493548 2.1144536 4.358092 -12.089823
+ 8.451689 -7.925461 4.6242585 4.4289427 18.692003
+ -2.6204622 -5.149185 -0.35821092 8.488551 4.981496
+ -9.32683 -2.2544234 6.6417594 1.2119585 10.977129
+ 16.555033 3.3238444 9.551863 -1.6676947 -0.79539716
+ -8.605674 -0.47356385 2.6741948 -5.359179 -2.6673796
+ 0.66607 15.443222 4.740594 -3.4725387 11.592567
+ -2.054497 1.7361217 -8.265324 -9.30447 5.4068313
+ -1.5180256 -7.746615 -6.089606 0.07112726 -0.34904733
+ -8.649895 -9.998958 -2.564841 -0.53999114 2.601808
+ -0.31927416 -1.8815292 -2.07215 -3.4105783 -8.2998085
+ 1.483641 -15.365992 -8.288208 3.8847756 -3.4876456
+ 7.3629923 0.4657332 3.132599 12.438889 -1.8337058
+ 4.532936 2.7264361 10.145339 -6.521951 2.897153
+ -3.3925855 5.079156 7.759716 4.677565 5.8457737
+ 2.402413 7.7071047 3.9711342 -6.390043 6.1268735
+ -3.7760346 -11.118123 ]
+ # get the test embedding
+ Test embedding Result:
+ [ -1.902964 2.0690894 -8.034194 3.5472693 0.18089125
+ 6.9085927 1.4097427 -1.9487704 -10.021278 -0.20755845
+ -8.04332 4.344489 2.3200977 -14.306299 5.184692
+ -11.55602 -3.8497238 0.6444722 1.2833948 2.6766639
+ 0.5878921 0.7946299 1.7207596 2.5791872 14.998469
+ -1.3385371 15.031221 -0.8006958 1.99287 -9.52007
+ 2.435466 4.003221 -4.33817 -4.898601 -5.304714
+ -18.033886 10.790787 -12.784645 -5.641755 2.9761686
+ -10.566622 1.4839455 6.152458 -5.7195854 2.8603241
+ 6.112133 8.489869 5.5958056 1.2836679 -1.2293907
+ 0.89927405 7.0288725 -2.854029 -0.9782962 5.8255906
+ 14.905906 -5.025907 0.7866458 -4.2444224 -16.354029
+ 10.521315 0.9604709 -3.3257897 7.144871 -13.592733
+ -8.568869 -1.7953678 0.26313916 10.916714 -6.9374123
+ 1.857403 -6.2746415 2.8154466 -7.2338667 -2.293357
+ -0.05452765 5.4287076 5.0849075 -6.690375 -1.6183422
+ 3.654291 0.94352573 -9.200294 -5.4749465 -3.5235846
+ 1.3420814 4.240421 -2.772944 -2.8451524 16.311104
+ 4.2969875 -1.762936 -12.5758915 8.595198 -0.8835239
+ -1.5708797 1.568961 1.1413603 3.5032008 -0.45251232
+ -6.786333 16.89443 5.3366146 -8.789056 0.6355629
+ 3.2579517 -3.328322 7.5969577 0.66025066 -6.550468
+ -9.148656 2.020372 -0.4615173 1.1965656 -3.8764873
+ 11.6562195 -6.0750933 12.182899 3.2218833 0.81969476
+ 5.570001 -3.8459578 -7.205299 7.9262037 -7.6611166
+ -5.249467 -2.2671914 7.2658715 -13.298164 4.821147
+ -2.7263982 11.691089 -3.8918593 -2.838112 -1.0336838
+ -3.8034165 2.8536487 -5.60398 -1.1972581 1.3455094
+ -3.4903061 2.2408795 5.5010734 -3.970756 11.99696
+ -7.8858757 0.43160373 -5.5059714 4.3426995 16.322706
+ 11.635366 0.72157705 -9.245714 -3.91465 -4.449838
+ -1.5716927 7.713747 -2.2430465 -6.198303 -13.481864
+ 2.8156567 -5.7812386 5.1456156 2.7289324 -14.505571
+ 13.270688 3.448231 -7.0659585 4.5886116 -4.466099
+ -0.296428 -11.463529 -2.6076477 14.110243 -6.9725137
+ -1.9962958 2.7119343 19.391657 0.01961198 14.607133
+ -1.6695905 -4.391516 1.3131028 -6.670972 -5.888604
+ 12.0612335 5.9285784 3.3715196 1.492534 10.723728
+ -0.95514804 -12.085431 ]
+ # get the score between enroll and test
+ Eembeddings Score: 0.4292638301849365
```
### 4.预训练模型
diff --git a/demos/speaker_verification/run.sh b/demos/speaker_verification/run.sh
index 856886d333cd30f983576875e809ed2016a51f50..6140f7f38111978d464c58cafd35aa9c4c0a7cb7 100644
--- a/demos/speaker_verification/run.sh
+++ b/demos/speaker_verification/run.sh
@@ -1,6 +1,9 @@
#!/bin/bash
wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
+wget -c https://paddlespeech.bj.bcebos.com/vector/audio/123456789.wav
-# asr
-paddlespeech vector --task spk --input ./85236145389.wav
\ No newline at end of file
+# vector
+paddlespeech vector --task spk --input ./85236145389.wav
+
+paddlespeech vector --task score --input "./85236145389.wav ./123456789.wav"
diff --git a/docs/source/released_model.md b/docs/source/released_model.md
index 9a423e03ecf685dc853119be2c69b9219ea1536a..1cbe398956797a1f54f2f210cee7dc04af35e3aa 100644
--- a/docs/source/released_model.md
+++ b/docs/source/released_model.md
@@ -6,7 +6,7 @@
### Speech Recognition Model
Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech | Example Link
:-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----: | :-----: | :-----:
-[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.080 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0)
+[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.078 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0)
[Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0)
[Conformer Online Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_chunk_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0565 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1)
[Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0483 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1)
@@ -37,8 +37,8 @@ Model Type | Dataset| Example Link | Pretrained Models|Static Models|Size (stati
Tacotron2|LJSpeech|[tacotron2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts0)|[tacotron2_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.2.0.zip)|||
Tacotron2|CSMSC|[tacotron2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts0)|[tacotron2_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_ckpt_0.2.0.zip)|[tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip)|103MB|
TransformerTTS| LJSpeech| [transformer-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts1)|[transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip)|||
-SpeedySpeech| CSMSC | [speedyspeech-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts2) |[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip)|[speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip)|12MB|
-FastSpeech2| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)|[fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip)|157MB|
+SpeedySpeech| CSMSC | [speedyspeech-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts2) |[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip)|[speedyspeech_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_static_0.2.0.zip)|12MB|
+FastSpeech2| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)|[fastspeech2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_static_0.2.0.zip)|157MB|
FastSpeech2-Conformer| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_conformer_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip)|||
FastSpeech2| AISHELL-3 |[fastspeech2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3)|[fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_ckpt_0.4.zip)|||
FastSpeech2| LJSpeech |[fastspeech2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts3)|[fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)|||
@@ -80,7 +80,7 @@ PANN | ESC-50 |[pann-esc50](../../examples/esc50/cls0)|[esc50_cnn6.tar.gz](https
Model Type | Dataset| Example Link | Pretrained Models | Static Models
:-------------:| :------------:| :-----: | :-----: | :-----:
-PANN | VoxCeleb| [voxceleb_ecapatdnn](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/voxceleb/sv0) | [ecapatdnn.tar.gz](https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz) | -
+PANN | VoxCeleb| [voxceleb_ecapatdnn](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/voxceleb/sv0) | [ecapatdnn.tar.gz](https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz) | -
## Punctuation Restoration Models
Model Type | Dataset| Example Link | Pretrained Models
diff --git a/examples/aishell/asr0/README.md b/examples/aishell/asr0/README.md
index bb45d8df0423ccb3f56a81229b25cf5e459696a8..16489992d93b088862368da45a7bee675246870b 100644
--- a/examples/aishell/asr0/README.md
+++ b/examples/aishell/asr0/README.md
@@ -173,12 +173,7 @@ bash local/data.sh --stage 2 --stop_stage 2
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1
```
-The performance of the released models are shown below:
-
-| Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech |
-| :----------------------------: | :-------------: | :---------: | -----: | :------------------------------------------------- | :---- | :--- | :-------------- |
-| Ds2 Online Aishell ASR0 Model | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.080 | - | 151 h |
-| Ds2 Offline Aishell ASR0 Model | Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers | 0.064 | - | 151 h |
+The performance of the released models are shown in [this](./RESULTS.md)
## Stage 4: Static graph model Export
This stage is to transform dygraph to static graph.
```bash
diff --git a/examples/aishell/asr0/RESULTS.md b/examples/aishell/asr0/RESULTS.md
index 5841a85220da92dd55ffb980e7e24ac398574709..8af3d66d17efa8fa62a276c67618cd45b03fd77d 100644
--- a/examples/aishell/asr0/RESULTS.md
+++ b/examples/aishell/asr0/RESULTS.md
@@ -4,15 +4,16 @@
| Model | Number of Params | Release | Config | Test set | Valid Loss | CER |
| --- | --- | --- | --- | --- | --- | --- |
-| DeepSpeech2 | 45.18M | 2.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.994938373565674 | 0.080 |
+| DeepSpeech2 | 45.18M | r0.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.708217620849609| 0.078 |
+| DeepSpeech2 | 45.18M | v2.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.994938373565674 | 0.080 |
## Deepspeech2 Non-Streaming
| Model | Number of Params | Release | Config | Test set | Valid Loss | CER |
| --- | --- | --- | --- | --- | --- | --- |
-| DeepSpeech2 | 58.4M | 2.2.0 | conf/deepspeech2.yaml + spec aug | test | 5.738585948944092 | 0.064000 |
-| DeepSpeech2 | 58.4M | 2.1.0 | conf/deepspeech2.yaml + spec aug | test | 7.483316898345947 | 0.077860 |
-| DeepSpeech2 | 58.4M | 2.1.0 | conf/deepspeech2.yaml | test | 7.299022197723389 | 0.078671 |
-| DeepSpeech2 | 58.4M | 2.0.0 | conf/deepspeech2.yaml | test | - | 0.078977 |
+| DeepSpeech2 | 58.4M | v2.2.0 | conf/deepspeech2.yaml + spec aug | test | 5.738585948944092 | 0.064000 |
+| DeepSpeech2 | 58.4M | v2.1.0 | conf/deepspeech2.yaml + spec aug | test | 7.483316898345947 | 0.077860 |
+| DeepSpeech2 | 58.4M | v2.1.0 | conf/deepspeech2.yaml | test | 7.299022197723389 | 0.078671 |
+| DeepSpeech2 | 58.4M | v2.0.0 | conf/deepspeech2.yaml | test | - | 0.078977 |
| --- | --- | --- | --- | --- | --- | --- |
-| DeepSpeech2 | 58.4M | 1.8.5 | - | test | - | 0.080447 |
+| DeepSpeech2 | 58.4M | v1.8.5 | - | test | - | 0.080447 |
diff --git a/examples/aishell3/vc0/README.md b/examples/aishell3/vc0/README.md
index 664ec1ac36349b64475d71b8f5ee4d916b0e47b4..925663ab1aefce9d6db6f4908d011cab4f77db79 100644
--- a/examples/aishell3/vc0/README.md
+++ b/examples/aishell3/vc0/README.md
@@ -118,7 +118,7 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_outpu
```
## Pretrained Model
-[tacotron2_aishell3_ckpt_vc0_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_aishell3_ckpt_vc0_0.2.0.zip)
+- [tacotron2_aishell3_ckpt_vc0_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_aishell3_ckpt_vc0_0.2.0.zip)
Model | Step | eval/loss | eval/l1_loss | eval/mse_loss | eval/bce_loss| eval/attn_loss
diff --git a/examples/aishell3/vc1/README.md b/examples/aishell3/vc1/README.md
index 04b83a5ffae71082da84958e8d20e982ffeb396f..8ab0f9c8cff833fdaaebdc408ef1c5841381e7ce 100644
--- a/examples/aishell3/vc1/README.md
+++ b/examples/aishell3/vc1/README.md
@@ -119,7 +119,7 @@ ref_audio
CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_output_path} ${ckpt_name} ${ge2e_params_path} ${ref_audio_dir}
```
## Pretrained Model
-[fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip)
+- [fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip)
Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss
:-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
diff --git a/examples/aishell3/voc1/README.md b/examples/aishell3/voc1/README.md
index dad464092d197de0065a3876ccc3605719de75ff..eb30e7c403c30dfeb1d466f558818eabda8dabfb 100644
--- a/examples/aishell3/voc1/README.md
+++ b/examples/aishell3/voc1/README.md
@@ -137,7 +137,8 @@ optional arguments:
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Models
-Pretrained models can be downloaded here [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip).
+Pretrained models can be downloaded here:
+- [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip)
Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss:| eval/spectral_convergence_loss
:-------------:| :------------:| :-----: | :-----: | :--------:
diff --git a/examples/aishell3/voc5/README.md b/examples/aishell3/voc5/README.md
index ebe2530beec58e87fa55f8ce6a203260858a21f8..c957c4a3aab385cd94adf03fc2cf12afd5bb351e 100644
--- a/examples/aishell3/voc5/README.md
+++ b/examples/aishell3/voc5/README.md
@@ -136,7 +136,8 @@ optional arguments:
4. `--output-dir` is the directory to save the synthesized audio files.
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Models
-The pretrained model can be downloaded here [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip).
+The pretrained model can be downloaded here:
+- [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip)
Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
diff --git a/examples/csmsc/tts0/README.md b/examples/csmsc/tts0/README.md
index 0129329aebd95010fdc6045be2151f0f6af8ea25..01376bd61e08055b6da9e71b4cfb812b8e35c5c9 100644
--- a/examples/csmsc/tts0/README.md
+++ b/examples/csmsc/tts0/README.md
@@ -212,7 +212,8 @@ optional arguments:
Pretrained Tacotron2 model with no silence in the edge of audios:
- [tacotron2_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_ckpt_0.2.0.zip)
-The static model can be downloaded here [tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip).
+The static model can be downloaded here:
+- [tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip)
Model | Step | eval/loss | eval/l1_loss | eval/mse_loss | eval/bce_loss| eval/attn_loss
diff --git a/examples/csmsc/tts2/README.md b/examples/csmsc/tts2/README.md
index 5f31f7b369429a50a1a83185cf14eb687a88861a..4fbe34cbf739314caa468ea1acea5139fc4bc131 100644
--- a/examples/csmsc/tts2/README.md
+++ b/examples/csmsc/tts2/README.md
@@ -221,9 +221,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path}
```
## Pretrained Model
-Pretrained SpeedySpeech model with no silence in the edge of audios[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip).
+Pretrained SpeedySpeech model with no silence in the edge of audios:
+- [speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip)
-The static model can be downloaded here [speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip).
+The static model can be downloaded here:
+- [speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip)
+- [speedyspeech_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_static_0.2.0.zip)
Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/ssim_loss
:-------------:| :------------:| :-----: | :-----: | :--------:|:--------:
diff --git a/examples/csmsc/tts3/README.md b/examples/csmsc/tts3/README.md
index ae8f7af607253861f96e5c59ac23f8e7c0d69c0e..bc672f66f1eea154436323fcef236456c3cd17b5 100644
--- a/examples/csmsc/tts3/README.md
+++ b/examples/csmsc/tts3/README.md
@@ -232,6 +232,9 @@ The static model can be downloaded here:
- [fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip)
- [fastspeech2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_static_0.2.0.zip)
+The ONNX model can be downloaded here:
+- [fastspeech2_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_onnx_0.2.0.zip)
+
Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss
:-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
default| 2(gpu) x 76000|1.0991|0.59132|0.035815|0.31915|0.15287|
diff --git a/examples/csmsc/tts3/local/ort_predict.sh b/examples/csmsc/tts3/local/ort_predict.sh
new file mode 100755
index 0000000000000000000000000000000000000000..3154f6e5abcd9baa213b2fc41c24adc6c6b453ed
--- /dev/null
+++ b/examples/csmsc/tts3/local/ort_predict.sh
@@ -0,0 +1,31 @@
+train_output_path=$1
+
+stage=0
+stop_stage=0
+
+# only support default_fastspeech2 + hifigan/mb_melgan now!
+
+# synthesize from metadata
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+ python3 ${BIN_DIR}/../ort_predict.py \
+ --inference_dir=${train_output_path}/inference_onnx \
+ --am=fastspeech2_csmsc \
+ --voc=hifigan_csmsc \
+ --test_metadata=dump/test/norm/metadata.jsonl \
+ --output_dir=${train_output_path}/onnx_infer_out \
+ --device=cpu \
+ --cpu_threads=2
+fi
+
+# e2e, synthesize from text
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+ python3 ${BIN_DIR}/../ort_predict_e2e.py \
+ --inference_dir=${train_output_path}/inference_onnx \
+ --am=fastspeech2_csmsc \
+ --voc=hifigan_csmsc \
+ --output_dir=${train_output_path}/onnx_infer_out_e2e \
+ --text=${BIN_DIR}/../csmsc_test.txt \
+ --phones_dict=dump/phone_id_map.txt \
+ --device=cpu \
+ --cpu_threads=2
+fi
diff --git a/examples/csmsc/tts3/local/paddle2onnx.sh b/examples/csmsc/tts3/local/paddle2onnx.sh
new file mode 100755
index 0000000000000000000000000000000000000000..505f3b663622063df9f40a6e7d13a6d8553a5025
--- /dev/null
+++ b/examples/csmsc/tts3/local/paddle2onnx.sh
@@ -0,0 +1,22 @@
+train_output_path=$1
+model_dir=$2
+output_dir=$3
+model=$4
+
+enable_dev_version=True
+
+model_name=${model%_*}
+echo model_name: ${model_name}
+
+if [ ${model_name} = 'mb_melgan' ] ;then
+ enable_dev_version=False
+fi
+
+mkdir -p ${train_output_path}/${output_dir}
+
+paddle2onnx \
+ --model_dir ${train_output_path}/${model_dir} \
+ --model_filename ${model}.pdmodel \
+ --params_filename ${model}.pdiparams \
+ --save_file ${train_output_path}/${output_dir}/${model}.onnx \
+ --enable_dev_version ${enable_dev_version}
\ No newline at end of file
diff --git a/examples/csmsc/tts3/run.sh b/examples/csmsc/tts3/run.sh
index e1a149b6524716dbf68c8b898cd8d8e5b22e57f6..b617d53527d16abd536a65226c93e7ae24592bc3 100755
--- a/examples/csmsc/tts3/run.sh
+++ b/examples/csmsc/tts3/run.sh
@@ -41,3 +41,25 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1
fi
+# paddle2onnx, please make sure the static models are in ${train_output_path}/inference first
+# we have only tested the following models so far
+if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
+ # install paddle2onnx
+ version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
+ if [[ -z "$version" || ${version} != '0.9.4' ]]; then
+ pip install paddle2onnx==0.9.4
+ fi
+ ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx fastspeech2_csmsc
+ ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx hifigan_csmsc
+ ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx mb_melgan_csmsc
+fi
+
+# inference with onnxruntime, use fastspeech2 + hifigan by default
+if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
+ # install onnxruntime
+ version=$(echo `pip list |grep "onnxruntime"` |awk -F" " '{print $2}')
+ if [[ -z "$version" || ${version} != '1.10.0' ]]; then
+ pip install onnxruntime==1.10.0
+ fi
+ ./local/ort_predict.sh ${train_output_path}
+fi
diff --git a/examples/csmsc/voc1/README.md b/examples/csmsc/voc1/README.md
index 5527e80888c12456c2e3dbd13cad8e79a6e93da4..2d6de168a18166bb12fff7c808607644d005ad96 100644
--- a/examples/csmsc/voc1/README.md
+++ b/examples/csmsc/voc1/README.md
@@ -127,9 +127,11 @@ optional arguments:
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Models
-The pretrained model can be downloaded here [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip).
+The pretrained model can be downloaded here:
+- [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip)
-The static model can be downloaded here [pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip).
+The static model can be downloaded here:
+- [pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip)
Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss| eval/spectral_convergence_loss
:-------------:| :------------:| :-----: | :-----: | :--------:
diff --git a/examples/csmsc/voc3/README.md b/examples/csmsc/voc3/README.md
index 22104a8f215f2c1eca29889778b98ac08575e193..12adaf7f4e2098f86e75c4d155951bccc8969f5e 100644
--- a/examples/csmsc/voc3/README.md
+++ b/examples/csmsc/voc3/README.md
@@ -152,11 +152,17 @@ TODO:
The hyperparameter of `finetune.yaml` is not good enough, a smaller `learning_rate` should be used (more `milestones` should be set).
## Pretrained Models
-The pretrained model can be downloaded here [mb_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_ckpt_0.1.1.zip).
+The pretrained model can be downloaded here:
+- [mb_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_ckpt_0.1.1.zip)
-The finetuned model can be downloaded here [mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip).
+The finetuned model can be downloaded here:
+- [mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip)
-The static model can be downloaded here [mb_melgan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_static_0.1.1.zip)
+The static model can be downloaded here:
+- [mb_melgan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_static_0.1.1.zip)
+
+The ONNX model can be downloaded here:
+- [mb_melgan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_onnx_0.2.0.zip)
Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss|eval/spectral_convergence_loss |eval/sub_log_stft_magnitude_loss|eval/sub_spectral_convergence_loss
:-------------:| :------------:| :-----: | :-----: | :--------:| :--------:| :--------:
diff --git a/examples/csmsc/voc4/README.md b/examples/csmsc/voc4/README.md
index b5c6873917602350d4e244fdcfd496d74e6192da..b7add3e574c63b61c22e75ca15289f2b6bc7ce51 100644
--- a/examples/csmsc/voc4/README.md
+++ b/examples/csmsc/voc4/README.md
@@ -112,7 +112,8 @@ optional arguments:
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Models
-The pretrained model can be downloaded here [style_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/style_melgan/style_melgan_csmsc_ckpt_0.1.1.zip).
+The pretrained model can be downloaded here:
+- [style_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/style_melgan/style_melgan_csmsc_ckpt_0.1.1.zip)
The static model of Style MelGAN is not available now.
diff --git a/examples/csmsc/voc5/README.md b/examples/csmsc/voc5/README.md
index 21afe6eefad51678568b0b7ab961ceffe33ac779..33e676165a3a2c65c1510d93a55f642bbec92116 100644
--- a/examples/csmsc/voc5/README.md
+++ b/examples/csmsc/voc5/README.md
@@ -112,9 +112,14 @@ optional arguments:
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Models
-The pretrained model can be downloaded here [hifigan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip).
+The pretrained model can be downloaded here:
+- [hifigan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip)
-The static model can be downloaded here [hifigan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_static_0.1.1.zip).
+The static model can be downloaded here:
+- [hifigan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_static_0.1.1.zip)
+
+The ONNX model can be downloaded here:
+- [hifigan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_onnx_0.2.0.zip)
Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
:-------------:| :------------:| :-----: | :-----: | :--------:
diff --git a/examples/csmsc/voc6/README.md b/examples/csmsc/voc6/README.md
index 7763b3551422d602f8b7321bb8a0a80b4a0f74e9..26d4523d91c9f9c7036e0ae4ae2053e0dc9402ef 100644
--- a/examples/csmsc/voc6/README.md
+++ b/examples/csmsc/voc6/README.md
@@ -109,9 +109,11 @@ optional arguments:
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Models
-The pretrained model can be downloaded here [wavernn_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip).
+The pretrained model can be downloaded here:
+- [wavernn_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip)
-The static model can be downloaded here [wavernn_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_static_0.2.0.zip).
+The static model can be downloaded here:
+- [wavernn_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_static_0.2.0.zip)
Model | Step | eval/loss
:-------------:|:------------:| :------------:
diff --git a/examples/iwslt2012/punc0/README.md b/examples/iwslt2012/punc0/README.md
index 74d599a21d9cc392abdbae60fcda81a65ff2e01d..6caa9710b914b50814c027d5af1e1803bc6de113 100644
--- a/examples/iwslt2012/punc0/README.md
+++ b/examples/iwslt2012/punc0/README.md
@@ -21,7 +21,7 @@
The pretrained model can be downloaded here [ernie_linear_p3_iwslt2012_zh_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/text/ernie_linear_p3_iwslt2012_zh_ckpt_0.1.1.zip).
### Test Result
-- Ernie Linear
+- Ernie
| |COMMA | PERIOD | QUESTION | OVERALL|
|:-----:|:-----:|:-----:|:-----:|:-----:|
|Precision |0.510955 |0.526462 |0.820755 |0.619391|
diff --git a/examples/iwslt2012/punc0/RESULTS.md b/examples/iwslt2012/punc0/RESULTS.md
new file mode 100644
index 0000000000000000000000000000000000000000..2e22713d858a9c0f89478857a756a1d8877ff8bd
--- /dev/null
+++ b/examples/iwslt2012/punc0/RESULTS.md
@@ -0,0 +1,9 @@
+# iwslt2012
+
+## Ernie
+
+| |COMMA | PERIOD | QUESTION | OVERALL|
+|:-----:|:-----:|:-----:|:-----:|:-----:|
+|Precision |0.510955 |0.526462 |0.820755 |0.619391|
+|Recall |0.517433 |0.564179 |0.861386 |0.647666|
+|F1 |0.514173 |0.544669 |0.840580 |0.633141|
diff --git a/examples/ljspeech/tts1/README.md b/examples/ljspeech/tts1/README.md
index 4f7680e8456d077a6641d3de6f5d54a7c3dd37cf..7f32522acd3f486a958d4e8640ee88275e7fbb8b 100644
--- a/examples/ljspeech/tts1/README.md
+++ b/examples/ljspeech/tts1/README.md
@@ -171,7 +171,8 @@ optional arguments:
6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model
-Pretrained Model can be downloaded here. [transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip)
+Pretrained Model can be downloaded here:
+- [transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip)
TransformerTTS checkpoint contains files listed below.
```text
diff --git a/examples/ljspeech/tts3/README.md b/examples/ljspeech/tts3/README.md
index f5e919c0fe45bff2fe66a599e8bf1c4030a0c91d..e028fa05d5a1748fab1a4fc3231f6da741701e76 100644
--- a/examples/ljspeech/tts3/README.md
+++ b/examples/ljspeech/tts3/README.md
@@ -214,7 +214,8 @@ optional arguments:
9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model
-Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)
+Pretrained FastSpeech2 model with no silence in the edge of audios:
+- [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)
Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss
:-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
diff --git a/examples/ljspeech/voc0/README.md b/examples/ljspeech/voc0/README.md
index 13a50efb54049f6a78c24cf90fc66f00ad2ef0df..41b08d57f42b4d81327a6d74e5b9abd394c19603 100644
--- a/examples/ljspeech/voc0/README.md
+++ b/examples/ljspeech/voc0/README.md
@@ -50,4 +50,5 @@ Synthesize waveform.
6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model
-Pretrained Model with residual channel equals 128 can be downloaded here. [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/waveflow/waveflow_ljspeech_ckpt_0.3.zip).
+Pretrained Model with residual channel equals 128 can be downloaded here:
+- [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/waveflow/waveflow_ljspeech_ckpt_0.3.zip)
diff --git a/examples/ljspeech/voc1/README.md b/examples/ljspeech/voc1/README.md
index 6fcb2a520d8167daaea24d7574bd3fa809a6090f..4513b2a05a67342a9be8d923fc517a7738eccf83 100644
--- a/examples/ljspeech/voc1/README.md
+++ b/examples/ljspeech/voc1/README.md
@@ -127,7 +127,8 @@ optional arguments:
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model
-Pretrained models can be downloaded here. [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip)
+Pretrained models can be downloaded here:
+- [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip)
Parallel WaveGAN checkpoint contains files listed below.
diff --git a/examples/ljspeech/voc5/README.md b/examples/ljspeech/voc5/README.md
index 9fbb9f74615bd9eef4e54f93542b3a836e9fb000..9b31e2650459c54ee1d6bed286f4f361077331ff 100644
--- a/examples/ljspeech/voc5/README.md
+++ b/examples/ljspeech/voc5/README.md
@@ -127,7 +127,8 @@ optional arguments:
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model
-The pretrained model can be downloaded here [hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip).
+The pretrained model can be downloaded here:
+- [hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip)
Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
@@ -143,6 +144,5 @@ hifigan_ljspeech_ckpt_0.2.0
└── snapshot_iter_2500000.pdz # generator parameters of hifigan
```
-
## Acknowledgement
We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN.
diff --git a/examples/vctk/tts3/README.md b/examples/vctk/tts3/README.md
index 157949d1fd24c3d3163ec133819dc3a32a09f7dd..f373ca6a387e53b8395570705f6e8576293055c0 100644
--- a/examples/vctk/tts3/README.md
+++ b/examples/vctk/tts3/README.md
@@ -217,7 +217,8 @@ optional arguments:
9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model
-Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip)
+Pretrained FastSpeech2 model with no silence in the edge of audios:
+- [fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip)
FastSpeech2 checkpoint contains files listed below.
```text
diff --git a/examples/vctk/voc1/README.md b/examples/vctk/voc1/README.md
index 4714f28dc2c7e7aa06857349a24ec6ac2d0423f6..1c3016f885de2bb15bd2af5d4866782cb0a81f80 100644
--- a/examples/vctk/voc1/README.md
+++ b/examples/vctk/voc1/README.md
@@ -132,7 +132,8 @@ optional arguments:
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model
-Pretrained models can be downloaded here [pwg_vctk_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.1.1.zip).
+Pretrained models can be downloaded here:
+- [pwg_vctk_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.1.1.zip)
Parallel WaveGAN checkpoint contains files listed below.
diff --git a/examples/vctk/voc5/README.md b/examples/vctk/voc5/README.md
index b4be341c0e52a74bbf475e22d39fd72bfe737e47..4eb25c02d7f97764d10ab2c6b5e871f06b61b148 100644
--- a/examples/vctk/voc5/README.md
+++ b/examples/vctk/voc5/README.md
@@ -133,7 +133,8 @@ optional arguments:
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model
-The pretrained model can be downloaded here [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip).
+The pretrained model can be downloaded here:
+- [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip)
Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
diff --git a/examples/voxceleb/sv0/RESULT.md b/examples/voxceleb/sv0/RESULT.md
index c37bcecef9b4276adcd7eb05b14893c48c3bdf96..3a3f67d09655cdc62daef128ca4822c6bb05163c 100644
--- a/examples/voxceleb/sv0/RESULT.md
+++ b/examples/voxceleb/sv0/RESULT.md
@@ -4,4 +4,4 @@
| Model | Number of Params | Release | Config | dim | Test set | Cosine | Cosine + S-Norm |
| --- | --- | --- | --- | --- | --- | --- | ---- |
-| ECAPA-TDNN | 85M | 0.1.1 | conf/ecapa_tdnn.yaml |192 | test | 1.15 | 1.06 |
+| ECAPA-TDNN | 85M | 0.2.0 | conf/ecapa_tdnn.yaml |192 | test | 1.02 | 0.95 |
diff --git a/paddleaudio/paddleaudio/utils/numeric.py b/paddleaudio/paddleaudio/utils/numeric.py
new file mode 100644
index 0000000000000000000000000000000000000000..126cada503f83e9d412d8c83c5728c42cf19c52b
--- /dev/null
+++ b/paddleaudio/paddleaudio/utils/numeric.py
@@ -0,0 +1,30 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+
+
+def pcm16to32(audio: np.ndarray) -> np.ndarray:
+ """pcm int16 to float32
+
+ Args:
+ audio (np.ndarray): Waveform with dtype of int16.
+
+ Returns:
+ np.ndarray: Waveform with dtype of float32.
+ """
+ if audio.dtype == np.int16:
+ audio = audio.astype("float32")
+ bits = np.iinfo(np.int16).bits
+ audio = audio / (2**(bits - 1))
+ return audio
diff --git a/paddlespeech/cli/asr/infer.py b/paddlespeech/cli/asr/infer.py
index 1fb4be43486fbe896b97d6d6a3ac766c53f208e1..b12b9f6fce89a44564ed66a4346b10032100a4af 100644
--- a/paddlespeech/cli/asr/infer.py
+++ b/paddlespeech/cli/asr/infer.py
@@ -80,9 +80,9 @@ pretrained_models = {
},
"deepspeech2online_aishell-zh-16k": {
'url':
- 'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz',
+ 'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz',
'md5':
- 'd5e076217cf60486519f72c217d21b9b',
+ '23e16c69730a1cb5d735c98c83c21e16',
'cfg_path':
'model.yaml',
'ckpt_path':
@@ -426,6 +426,11 @@ class ASRExecutor(BaseExecutor):
try:
audio, audio_sample_rate = soundfile.read(
audio_file, dtype="int16", always_2d=True)
+ audio_duration = audio.shape[0] / audio_sample_rate
+ max_duration = 50.0
+ if audio_duration >= max_duration:
+ logger.error("Please input audio file less then 50 seconds.\n")
+ return
except Exception as e:
logger.exception(e)
logger.error(
diff --git a/paddlespeech/cli/vector/infer.py b/paddlespeech/cli/vector/infer.py
index 175a9723e1bd811be97de5995996d79f0ef19307..68e832ac74d4dda805a4185ab09a72f2eb7d6413 100644
--- a/paddlespeech/cli/vector/infer.py
+++ b/paddlespeech/cli/vector/infer.py
@@ -15,6 +15,7 @@ import argparse
import os
import sys
from collections import OrderedDict
+from typing import Dict
from typing import List
from typing import Optional
from typing import Union
@@ -42,9 +43,9 @@ pretrained_models = {
# "paddlespeech vector --task spk --model ecapatdnn_voxceleb12-16k --sr 16000 --input ./input.wav"
"ecapatdnn_voxceleb12-16k": {
'url':
- 'https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz',
+ 'https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz',
'md5':
- 'a1c0dba7d4de997187786ff517d5b4ec',
+ 'cc33023c54ab346cd318408f43fcaf95',
'cfg_path':
'conf/model.yaml', # the yaml config path
'ckpt_path':
@@ -79,7 +80,7 @@ class VectorExecutor(BaseExecutor):
"--task",
type=str,
default="spk",
- choices=["spk"],
+ choices=["spk", "score"],
help="task type in vector domain")
self.parser.add_argument(
"--input",
@@ -147,13 +148,40 @@ class VectorExecutor(BaseExecutor):
logger.info(f"task source: {task_source}")
# stage 3: process the audio one by one
+ # we do action according the task type
task_result = OrderedDict()
has_exceptions = False
for id_, input_ in task_source.items():
try:
- res = self(input_, model, sample_rate, config, ckpt_path,
- device)
- task_result[id_] = res
+ # extract the speaker audio embedding
+ if parser_args.task == "spk":
+ logger.info("do vector spk task")
+ res = self(input_, model, sample_rate, config, ckpt_path,
+ device)
+ task_result[id_] = res
+ elif parser_args.task == "score":
+ logger.info("do vector score task")
+ logger.info(f"input content {input_}")
+ if len(input_.split()) != 2:
+ logger.error(
+ f"vector score task input {input_} wav num is not two,"
+ "that is {len(input_.split())}")
+ sys.exit(-1)
+
+ # get the enroll and test embedding
+ enroll_audio, test_audio = input_.split()
+ logger.info(
+ f"score task, enroll audio: {enroll_audio}, test audio: {test_audio}"
+ )
+ enroll_embedding = self(enroll_audio, model, sample_rate,
+ config, ckpt_path, device)
+ test_embedding = self(test_audio, model, sample_rate,
+ config, ckpt_path, device)
+
+ # get the score
+ res = self.get_embeddings_score(enroll_embedding,
+ test_embedding)
+ task_result[id_] = res
except Exception as e:
has_exceptions = True
task_result[id_] = f'{e.__class__.__name__}: {e}'
@@ -172,6 +200,49 @@ class VectorExecutor(BaseExecutor):
else:
return True
+ def _get_job_contents(
+ self, job_input: os.PathLike) -> Dict[str, Union[str, os.PathLike]]:
+ """
+ Read a job input file and return its contents in a dictionary.
+ Refactor from the Executor._get_job_contents
+
+ Args:
+ job_input (os.PathLike): The job input file.
+
+ Returns:
+ Dict[str, str]: Contents of job input.
+ """
+ job_contents = OrderedDict()
+ with open(job_input) as f:
+ for line in f:
+ line = line.strip()
+ if not line:
+ continue
+ k = line.split(' ')[0]
+ v = ' '.join(line.split(' ')[1:])
+ job_contents[k] = v
+ return job_contents
+
+ def get_embeddings_score(self, enroll_embedding, test_embedding):
+ """get the enroll embedding and test embedding score
+
+ Args:
+ enroll_embedding (numpy.array): shape: (emb_size), enroll audio embedding
+ test_embedding (numpy.array): shape: (emb_size), test audio embedding
+
+ Returns:
+ score: the score between enroll embedding and test embedding
+ """
+ if not hasattr(self, "score_func"):
+ self.score_func = paddle.nn.CosineSimilarity(axis=0)
+ logger.info("create the cosine score function ")
+
+ score = self.score_func(
+ paddle.to_tensor(enroll_embedding),
+ paddle.to_tensor(test_embedding))
+
+ return score.item()
+
@stats_wrapper
def __call__(self,
audio_file: os.PathLike,
diff --git a/paddlespeech/server/engine/asr/online/asr_engine.py b/paddlespeech/server/engine/asr/online/asr_engine.py
index 389175a0a0c257903f8b4c296842923bf1b73cf7..9029aa6e9e45a24f06c6a806bff0c82dd1e84d95 100644
--- a/paddlespeech/server/engine/asr/online/asr_engine.py
+++ b/paddlespeech/server/engine/asr/online/asr_engine.py
@@ -36,7 +36,7 @@ pretrained_models = {
'url':
'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz',
'md5':
- 'd5e076217cf60486519f72c217d21b9b',
+ '23e16c69730a1cb5d735c98c83c21e16',
'cfg_path':
'model.yaml',
'ckpt_path':
diff --git a/paddlespeech/t2s/exps/fastspeech2/preprocess.py b/paddlespeech/t2s/exps/fastspeech2/preprocess.py
index 5bda75451b071321e681adceb598f29162b5fb8c..db1842b2e89fe3044e96ca4babb07c1796d06da3 100644
--- a/paddlespeech/t2s/exps/fastspeech2/preprocess.py
+++ b/paddlespeech/t2s/exps/fastspeech2/preprocess.py
@@ -86,6 +86,9 @@ def process_sentence(config: Dict[str, Any],
logmel = mel_extractor.get_log_mel_fbank(wav)
# change duration according to mel_length
compare_duration_and_mel_length(sentences, utt_id, logmel)
+ # utt_id may be popped in compare_duration_and_mel_length
+ if utt_id not in sentences:
+ return None
phones = sentences[utt_id][0]
durations = sentences[utt_id][1]
num_frames = logmel.shape[0]
diff --git a/paddlespeech/t2s/exps/inference.py b/paddlespeech/t2s/exps/inference.py
index 1188ddfb132151e095ba0541ecb0fce12ad7e4ab..62602a01f28c4365c80ba6fb01b98cef2572a579 100644
--- a/paddlespeech/t2s/exps/inference.py
+++ b/paddlespeech/t2s/exps/inference.py
@@ -104,7 +104,7 @@ def get_voc_output(args, voc_predictor, input):
def parse_args():
parser = argparse.ArgumentParser(
- description="Paddle Infernce with speedyspeech & parallel wavegan.")
+ description="Paddle Infernce with acoustic model & vocoder.")
# acoustic model
parser.add_argument(
'--am',
diff --git a/paddlespeech/t2s/exps/ort_predict.py b/paddlespeech/t2s/exps/ort_predict.py
new file mode 100644
index 0000000000000000000000000000000000000000..e8d4d61c32e09983e66346a2b9f6a26a7c269846
--- /dev/null
+++ b/paddlespeech/t2s/exps/ort_predict.py
@@ -0,0 +1,156 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+from pathlib import Path
+
+import jsonlines
+import numpy as np
+import onnxruntime as ort
+import soundfile as sf
+from timer import timer
+
+from paddlespeech.t2s.exps.syn_utils import get_test_dataset
+from paddlespeech.t2s.utils import str2bool
+
+
+def get_sess(args, filed='am'):
+ full_name = ''
+ if filed == 'am':
+ full_name = args.am
+ elif filed == 'voc':
+ full_name = args.voc
+ model_dir = str(Path(args.inference_dir) / (full_name + ".onnx"))
+ sess_options = ort.SessionOptions()
+ sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
+ sess_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
+
+ if args.device == "gpu":
+ # fastspeech2/mb_melgan can't use trt now!
+ if args.use_trt:
+ providers = ['TensorrtExecutionProvider']
+ else:
+ providers = ['CUDAExecutionProvider']
+ elif args.device == "cpu":
+ providers = ['CPUExecutionProvider']
+ sess_options.intra_op_num_threads = args.cpu_threads
+ sess = ort.InferenceSession(
+ model_dir, providers=providers, sess_options=sess_options)
+ return sess
+
+
+def ort_predict(args):
+ # construct dataset for evaluation
+ with jsonlines.open(args.test_metadata, 'r') as reader:
+ test_metadata = list(reader)
+ am_name = args.am[:args.am.rindex('_')]
+ am_dataset = args.am[args.am.rindex('_') + 1:]
+ test_dataset = get_test_dataset(args, test_metadata, am_name, am_dataset)
+
+ output_dir = Path(args.output_dir)
+ output_dir.mkdir(parents=True, exist_ok=True)
+
+ fs = 24000 if am_dataset != 'ljspeech' else 22050
+
+ # am
+ am_sess = get_sess(args, filed='am')
+
+ # vocoder
+ voc_sess = get_sess(args, filed='voc')
+
+ # am warmup
+ for T in [27, 38, 54]:
+ data = np.random.randint(1, 266, size=(T, ))
+ am_sess.run(None, {"text": data})
+
+ # voc warmup
+ for T in [227, 308, 544]:
+ data = np.random.rand(T, 80).astype("float32")
+ voc_sess.run(None, {"logmel": data})
+ print("warm up done!")
+
+ N = 0
+ T = 0
+ for example in test_dataset:
+ utt_id = example['utt_id']
+ phone_ids = example["text"]
+ with timer() as t:
+ mel = am_sess.run(output_names=None, input_feed={'text': phone_ids})
+ mel = mel[0]
+ wav = voc_sess.run(output_names=None, input_feed={'logmel': mel})
+
+ N += len(wav[0])
+ T += t.elapse
+ speed = len(wav[0]) / t.elapse
+ rtf = fs / speed
+ sf.write(
+ str(output_dir / (utt_id + ".wav")),
+ np.array(wav)[0],
+ samplerate=fs)
+ print(
+ f"{utt_id}, mel: {mel.shape}, wave: {len(wav[0])}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}."
+ )
+ print(f"generation speed: {N / T}Hz, RTF: {fs / (N / T) }")
+
+
+def parse_args():
+ parser = argparse.ArgumentParser(description="Infernce with onnxruntime.")
+ # acoustic model
+ parser.add_argument(
+ '--am',
+ type=str,
+ default='fastspeech2_csmsc',
+ choices=[
+ 'fastspeech2_csmsc',
+ ],
+ help='Choose acoustic model type of tts task.')
+
+ # voc
+ parser.add_argument(
+ '--voc',
+ type=str,
+ default='hifigan_csmsc',
+ choices=['hifigan_csmsc', 'mb_melgan_csmsc'],
+ help='Choose vocoder type of tts task.')
+ # other
+ parser.add_argument(
+ "--inference_dir", type=str, help="dir to save inference models")
+ parser.add_argument("--test_metadata", type=str, help="test metadata.")
+ parser.add_argument("--output_dir", type=str, help="output dir")
+
+ # inference
+ parser.add_argument(
+ "--use_trt",
+ type=str2bool,
+ default=False,
+ help="Whether to use inference engin TensorRT.", )
+
+ parser.add_argument(
+ "--device",
+ default="gpu",
+ choices=["gpu", "cpu"],
+ help="Device selected for inference.", )
+ parser.add_argument('--cpu_threads', type=int, default=1)
+
+ args, _ = parser.parse_known_args()
+ return args
+
+
+def main():
+ args = parse_args()
+
+ ort_predict(args)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/paddlespeech/t2s/exps/ort_predict_e2e.py b/paddlespeech/t2s/exps/ort_predict_e2e.py
new file mode 100644
index 0000000000000000000000000000000000000000..8aa04cbc556ad36d7275c2b5ccad3cd9fa5b139b
--- /dev/null
+++ b/paddlespeech/t2s/exps/ort_predict_e2e.py
@@ -0,0 +1,183 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+from pathlib import Path
+
+import numpy as np
+import onnxruntime as ort
+import soundfile as sf
+from timer import timer
+
+from paddlespeech.t2s.exps.syn_utils import get_frontend
+from paddlespeech.t2s.exps.syn_utils import get_sentences
+from paddlespeech.t2s.utils import str2bool
+
+
+def get_sess(args, filed='am'):
+ full_name = ''
+ if filed == 'am':
+ full_name = args.am
+ elif filed == 'voc':
+ full_name = args.voc
+ model_dir = str(Path(args.inference_dir) / (full_name + ".onnx"))
+ sess_options = ort.SessionOptions()
+ sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
+ sess_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
+
+ if args.device == "gpu":
+ # fastspeech2/mb_melgan can't use trt now!
+ if args.use_trt:
+ providers = ['TensorrtExecutionProvider']
+ else:
+ providers = ['CUDAExecutionProvider']
+ elif args.device == "cpu":
+ providers = ['CPUExecutionProvider']
+ sess_options.intra_op_num_threads = args.cpu_threads
+ sess = ort.InferenceSession(
+ model_dir, providers=providers, sess_options=sess_options)
+ return sess
+
+
+def ort_predict(args):
+
+ # frontend
+ frontend = get_frontend(args)
+
+ output_dir = Path(args.output_dir)
+ output_dir.mkdir(parents=True, exist_ok=True)
+ sentences = get_sentences(args)
+
+ am_name = args.am[:args.am.rindex('_')]
+ am_dataset = args.am[args.am.rindex('_') + 1:]
+ fs = 24000 if am_dataset != 'ljspeech' else 22050
+
+ # am
+ am_sess = get_sess(args, filed='am')
+
+ # vocoder
+ voc_sess = get_sess(args, filed='voc')
+
+ # am warmup
+ for T in [27, 38, 54]:
+ data = np.random.randint(1, 266, size=(T, ))
+ am_sess.run(None, {"text": data})
+
+ # voc warmup
+ for T in [227, 308, 544]:
+ data = np.random.rand(T, 80).astype("float32")
+ voc_sess.run(None, {"logmel": data})
+ print("warm up done!")
+
+ # frontend warmup
+ # Loading model cost 0.5+ seconds
+ if args.lang == 'zh':
+ frontend.get_input_ids("你好,欢迎使用飞桨框架进行深度学习研究!", merge_sentences=True)
+ else:
+ print("lang should in be 'zh' here!")
+
+ N = 0
+ T = 0
+ merge_sentences = True
+ for utt_id, sentence in sentences:
+ with timer() as t:
+ if args.lang == 'zh':
+ input_ids = frontend.get_input_ids(
+ sentence, merge_sentences=merge_sentences)
+
+ phone_ids = input_ids["phone_ids"]
+ else:
+ print("lang should in be 'zh' here!")
+ # merge_sentences=True here, so we only use the first item of phone_ids
+ phone_ids = phone_ids[0].numpy()
+ mel = am_sess.run(output_names=None, input_feed={'text': phone_ids})
+ mel = mel[0]
+ wav = voc_sess.run(output_names=None, input_feed={'logmel': mel})
+
+ N += len(wav[0])
+ T += t.elapse
+ speed = len(wav[0]) / t.elapse
+ rtf = fs / speed
+ sf.write(
+ str(output_dir / (utt_id + ".wav")),
+ np.array(wav)[0],
+ samplerate=fs)
+ print(
+ f"{utt_id}, mel: {mel.shape}, wave: {len(wav[0])}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}."
+ )
+ print(f"generation speed: {N / T}Hz, RTF: {fs / (N / T) }")
+
+
+def parse_args():
+ parser = argparse.ArgumentParser(description="Infernce with onnxruntime.")
+ # acoustic model
+ parser.add_argument(
+ '--am',
+ type=str,
+ default='fastspeech2_csmsc',
+ choices=[
+ 'fastspeech2_csmsc',
+ ],
+ help='Choose acoustic model type of tts task.')
+ parser.add_argument(
+ "--phones_dict", type=str, default=None, help="phone vocabulary file.")
+ parser.add_argument(
+ "--tones_dict", type=str, default=None, help="tone vocabulary file.")
+
+ # voc
+ parser.add_argument(
+ '--voc',
+ type=str,
+ default='hifigan_csmsc',
+ choices=['hifigan_csmsc', 'mb_melgan_csmsc'],
+ help='Choose vocoder type of tts task.')
+ # other
+ parser.add_argument(
+ "--inference_dir", type=str, help="dir to save inference models")
+ parser.add_argument(
+ "--text",
+ type=str,
+ help="text to synthesize, a 'utt_id sentence' pair per line")
+ parser.add_argument("--output_dir", type=str, help="output dir")
+ parser.add_argument(
+ '--lang',
+ type=str,
+ default='zh',
+ help='Choose model language. zh or en')
+
+ # inference
+ parser.add_argument(
+ "--use_trt",
+ type=str2bool,
+ default=False,
+ help="Whether to use inference engin TensorRT.", )
+
+ parser.add_argument(
+ "--device",
+ default="gpu",
+ choices=["gpu", "cpu"],
+ help="Device selected for inference.", )
+ parser.add_argument('--cpu_threads', type=int, default=1)
+
+ args, _ = parser.parse_known_args()
+ return args
+
+
+def main():
+ args = parse_args()
+
+ ort_predict(args)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/paddlespeech/t2s/exps/speedyspeech/preprocess.py b/paddlespeech/t2s/exps/speedyspeech/preprocess.py
index 3f81c4e14753d19e51db5d23f5e75440c67de34b..e833d13940530f293842a842b65f33cf6d03d9bd 100644
--- a/paddlespeech/t2s/exps/speedyspeech/preprocess.py
+++ b/paddlespeech/t2s/exps/speedyspeech/preprocess.py
@@ -79,6 +79,9 @@ def process_sentence(config: Dict[str, Any],
logmel = mel_extractor.get_log_mel_fbank(wav)
# change duration according to mel_length
compare_duration_and_mel_length(sentences, utt_id, logmel)
+ # utt_id may be popped in compare_duration_and_mel_length
+ if utt_id not in sentences:
+ return None
labels = sentences[utt_id][0]
# extract phone and duration
phones = []
diff --git a/paddlespeech/t2s/exps/synthesize_streaming.py b/paddlespeech/t2s/exps/synthesize_streaming.py
index f38b2d3522f0c047ef6b3351f2db71130d2cebc4..7b9906c1076edda751c4a10772df0f82e8f1f39b 100644
--- a/paddlespeech/t2s/exps/synthesize_streaming.py
+++ b/paddlespeech/t2s/exps/synthesize_streaming.py
@@ -90,6 +90,7 @@ def evaluate(args):
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
merge_sentences = True
+ get_tone_ids = False
N = 0
T = 0
@@ -98,8 +99,6 @@ def evaluate(args):
for utt_id, sentence in sentences:
with timer() as t:
- get_tone_ids = False
-
if args.lang == 'zh':
input_ids = frontend.get_input_ids(
sentence,
diff --git a/paddlespeech/t2s/exps/tacotron2/preprocess.py b/paddlespeech/t2s/exps/tacotron2/preprocess.py
index 7f41089ebf9d71b336d082b065e8b50c541f7edd..14a0d7eae227f5650a716bc656f3d0c32ee077e3 100644
--- a/paddlespeech/t2s/exps/tacotron2/preprocess.py
+++ b/paddlespeech/t2s/exps/tacotron2/preprocess.py
@@ -82,6 +82,9 @@ def process_sentence(config: Dict[str, Any],
logmel = mel_extractor.get_log_mel_fbank(wav)
# change duration according to mel_length
compare_duration_and_mel_length(sentences, utt_id, logmel)
+ # utt_id may be popped in compare_duration_and_mel_length
+ if utt_id not in sentences:
+ return None
phones = sentences[utt_id][0]
durations = sentences[utt_id][1]
num_frames = logmel.shape[0]
diff --git a/paddlespeech/t2s/modules/positional_encoding.py b/paddlespeech/t2s/modules/positional_encoding.py
index 7c368c3aa8a7520557b18d3bf0c09febd9151ddf..715c576f52ba586992b77e66b398f1a56e8a0fc7 100644
--- a/paddlespeech/t2s/modules/positional_encoding.py
+++ b/paddlespeech/t2s/modules/positional_encoding.py
@@ -31,8 +31,9 @@ def sinusoid_position_encoding(num_positions: int,
channel = paddle.arange(0, feature_size, 2, dtype=dtype)
index = paddle.arange(start_pos, start_pos + num_positions, 1, dtype=dtype)
- p = (paddle.unsqueeze(index, -1) *
- omega) / (10000.0**(channel / float(feature_size)))
+ denominator = channel / float(feature_size)
+ denominator = paddle.to_tensor([10000.0], dtype='float32')**denominator
+ p = (paddle.unsqueeze(index, -1) * omega) / denominator
encodings = paddle.zeros([num_positions, feature_size], dtype=dtype)
encodings[:, 0::2] = paddle.sin(p)
encodings[:, 1::2] = paddle.cos(p)
diff --git a/paddlespeech/vector/models/ecapa_tdnn.py b/paddlespeech/vector/models/ecapa_tdnn.py
index 0e7287cd3614d8964941f6d14179e0ce7f3c4d71..895ff13f4509c7070d2473aebf8ce693a50dbcee 100644
--- a/paddlespeech/vector/models/ecapa_tdnn.py
+++ b/paddlespeech/vector/models/ecapa_tdnn.py
@@ -79,6 +79,20 @@ class Conv1d(nn.Layer):
bias_attr=bias, )
def forward(self, x):
+ """Do conv1d forward
+
+ Args:
+ x (paddle.Tensor): [N, C, L] input data,
+ N is the batch,
+ C is the data dimension,
+ L is the time
+
+ Raises:
+ ValueError: only support the same padding type
+
+ Returns:
+ paddle.Tensor: the value of conv1d
+ """
if self.padding == "same":
x = self._manage_padding(x, self.kernel_size, self.dilation,
self.stride)
@@ -88,6 +102,20 @@ class Conv1d(nn.Layer):
return self.conv(x)
def _manage_padding(self, x, kernel_size: int, dilation: int, stride: int):
+ """Padding the input data
+
+ Args:
+ x (paddle.Tensor): [N, C, L] input data
+ N is the batch,
+ C is the data dimension,
+ L is the time
+ kernel_size (int): 1-d convolution kernel size
+ dilation (int): 1-d convolution dilation
+ stride (int): 1-d convolution stride
+
+ Returns:
+ paddle.Tensor: the padded input data
+ """
L_in = x.shape[-1] # Detecting input shape
padding = self._get_padding_elem(L_in, stride, kernel_size,
dilation) # Time padding
@@ -101,6 +129,17 @@ class Conv1d(nn.Layer):
stride: int,
kernel_size: int,
dilation: int):
+ """Calculate the padding value in same mode
+
+ Args:
+ L_in (int): the times of the input data,
+ stride (int): 1-d convolution stride
+ kernel_size (int): 1-d convolution kernel size
+ dilation (int): 1-d convolution stride
+
+ Returns:
+ int: return the padding value in same mode
+ """
if stride > 1:
n_steps = math.ceil(((L_in - kernel_size * dilation) / stride) + 1)
L_out = stride * (n_steps - 1) + kernel_size * dilation
@@ -245,6 +284,13 @@ class SEBlock(nn.Layer):
class AttentiveStatisticsPooling(nn.Layer):
def __init__(self, channels, attention_channels=128, global_context=True):
+ """Compute the speaker verification statistics
+ The detail info is section 3.1 in https://arxiv.org/pdf/1709.01507.pdf
+ Args:
+ channels (int): input data channel or data dimension
+ attention_channels (int, optional): attention dimension. Defaults to 128.
+ global_context (bool, optional): If use the global context information. Defaults to True.
+ """
super().__init__()
self.eps = 1e-12