diff --git a/README.md b/README.md index a90498293e6dd2b01aa9649105f1f2b075a3bf36..5093dbd678a895026a5021c655f944adc202054f 100644 --- a/README.md +++ b/README.md @@ -280,10 +280,14 @@ paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav For more information about server command lines, please see: [speech server demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/speech_server) + + ## Model List PaddleSpeech supports a series of most popular models. They are summarized in [released models](./docs/source/released_model.md) and attached with available pretrained models. + + **Speech-to-Text** contains *Acoustic Model*, *Language Model*, and *Speech Translation*, with the following details: @@ -357,6 +361,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
+ + **Text-to-Speech** in PaddleSpeech mainly contains three modules: *Text Frontend*, *Acoustic Model* and *Vocoder*. Acoustic Model and Vocoder models are listed as follow: @@ -457,10 +463,10 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r - + @@ -473,6 +479,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
GE2E + Tactron2GE2E + Tacotron2 AISHELL-3 - ge2e-tactron2-aishell3 + ge2e-tacotron2-aishell3
+ + **Audio Classification** @@ -496,6 +504,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
+ + **Speaker Verification** @@ -519,6 +529,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
+ + **Punctuation Restoration** @@ -559,10 +571,18 @@ Normally, [Speech SoTA](https://paperswithcode.com/area/speech), [Audio SoTA](ht - [Advanced Usage](./docs/source/tts/advanced_usage.md) - [Chinese Rule Based Text Frontend](./docs/source/tts/zh_text_frontend.md) - [Test Audio Samples](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html) + - Speaker Verification + - [Audio Searching](./demos/audio_searching/README.md) + - [Speaker Verification](./demos/speaker_verification/README.md) - [Audio Classification](./demos/audio_tagging/README.md) - - [Speaker Verification](./demos/speaker_verification/README.md) - [Speech Translation](./demos/speech_translation/README.md) + - [Speech Server](./demos/speech_server/README.md) - [Released Models](./docs/source/released_model.md) + - [Speech-to-Text](#SpeechToText) + - [Text-to-Speech](#TextToSpeech) + - [Audio Classification](#AudioClassification) + - [Speaker Verification](#SpeakerVerification) + - [Punctuation Restoration](#PunctuationRestoration) - [Community](#Community) - [Welcome to contribute](#contribution) - [License](#License) diff --git a/README_cn.md b/README_cn.md index ab4ce6e6b878626011ac5cbcfb5c82b4b03ef5d6..5dab7fa0c034fe778c7f7c10e65a78fb6c3e52b5 100644 --- a/README_cn.md +++ b/README_cn.md @@ -273,6 +273,8 @@ paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav ## 模型列表 PaddleSpeech 支持很多主流的模型,并提供了预训练模型,详情请见[模型列表](./docs/source/released_model.md)。 + + PaddleSpeech 的 **语音转文本** 包含语音识别声学模型、语音识别语言模型和语音翻译, 详情如下:
@@ -347,6 +349,7 @@ PaddleSpeech 的 **语音转文本** 包含语音识别声学模型、语音识
+ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声学模型和声码器。声学模型和声码器模型如下: @@ -447,10 +450,10 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声 - + @@ -488,6 +491,8 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
GE2E + Tactron2GE2E + Tacotron2 AISHELL-3 - ge2e-tactron2-aishell3 + ge2e-tacotron2-aishell3
+ + **声纹识别** @@ -511,6 +516,8 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
+ + **标点恢复** @@ -556,13 +563,18 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声 - [进阶用法](./docs/source/tts/advanced_usage.md) - [中文文本前端](./docs/source/tts/zh_text_frontend.md) - [测试语音样本](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html) + - 声纹识别 + - [声纹识别](./demos/speaker_verification/README_cn.md) + - [音频检索](./demos/audio_searching/README_cn.md) - [声音分类](./demos/audio_tagging/README_cn.md) - - [声纹识别](./demos/speaker_verification/README_cn.md) - [语音翻译](./demos/speech_translation/README_cn.md) + - [服务化部署](./demos/speech_server/README_cn.md) - [模型列表](#模型列表) - [语音识别](#语音识别模型) - [语音合成](#语音合成模型) - [声音分类](#声音分类模型) + - [声纹识别](#声纹识别模型) + - [标点恢复](#标点恢复模型) - [技术交流群](#技术交流群) - [欢迎贡献](#欢迎贡献) - [License](#License) diff --git a/demos/speaker_verification/README.md b/demos/speaker_verification/README.md index 8739d402da97a576e5c1349fd01913e3c399911e..7d7180ae9df6ef2c34bd414bfe65ecfc7284fc60 100644 --- a/demos/speaker_verification/README.md +++ b/demos/speaker_verification/README.md @@ -30,6 +30,11 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav paddlespeech vector --task spk --input vec.job echo -e "demo2 85236145389.wav \n demo3 85236145389.wav" | paddlespeech vector --task spk + + paddlespeech vector --task score --input "./85236145389.wav ./123456789.wav" + + echo -e "demo4 85236145389.wav 85236145389.wav \n demo5 85236145389.wav 123456789.wav" > vec.job + paddlespeech vector --task score --input vec.job ``` Usage: @@ -38,6 +43,7 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav ``` Arguments: - `input`(required): Audio file to recognize. + - `task` (required): Specify `vector` task. Default `spk`。 - `model`: Model type of vector task. Default: `ecapatdnn_voxceleb12`. - `sample_rate`: Sample rate of the model. Default: `16000`. - `config`: Config of vector task. Use pretrained model when it is None. Default: `None`. @@ -47,45 +53,45 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav Output: ```bash - demo [ -5.749211 9.505463 -8.200284 -5.2075014 5.3940268 - -3.04878 1.611095 10.127234 -10.534177 -15.821609 - 1.2032688 -0.35080156 1.2629458 -12.643498 -2.5758228 - -11.343508 2.3385992 -8.719341 14.213509 15.404744 - -0.39327756 6.338786 2.688887 8.7104025 17.469526 - -8.77959 7.0576906 4.648855 -1.3089896 -23.294737 - 8.013747 13.891729 -9.926753 5.655307 -5.9422326 - -22.842539 0.6293588 -18.46266 -10.811862 9.8192625 - 3.0070958 3.8072643 -2.3861165 3.0821571 -14.739942 - 1.7594414 -0.6485091 4.485623 2.0207152 7.264915 - -6.40137 23.63524 2.9711294 -22.708025 9.93719 - 20.354511 -10.324688 -0.700492 -8.783211 -5.27593 - 15.999649 3.3004563 12.747926 15.429879 4.7849145 - 5.6699696 -2.3826702 10.605882 3.9112158 3.1500628 - 15.859915 -2.1832209 -23.908653 -6.4799504 -4.5365124 - -9.224193 14.568347 -10.568833 4.982321 -4.342062 - 0.0914714 12.645902 -5.74285 -3.2141201 -2.7173362 - -6.680575 0.4757669 -5.035051 -6.7964664 16.865469 - -11.54324 7.681869 0.44475392 9.708182 -8.932846 - 0.4123232 -4.361452 1.3948607 9.511665 0.11667654 - 2.9079323 6.049952 9.275183 -18.078873 6.2983274 - -0.7500531 -2.725033 -7.6027865 3.3404543 2.990815 - 4.010979 11.000591 -2.8873312 7.1352735 -16.79663 - 18.495346 -14.293832 7.89578 2.2714825 22.976387 - -4.875734 -3.0836344 -2.9999814 13.751918 6.448228 - -11.924197 2.171869 2.0423572 -6.173772 10.778437 - 25.77281 -4.9495463 14.57806 0.3044315 2.6132357 - -7.591999 -2.076944 9.025118 1.7834753 -3.1799617 - -4.9401326 23.465864 5.1685796 -9.018578 9.037825 - -4.4150195 6.859591 -12.274467 -0.88911164 5.186309 - -3.9988663 -13.638606 -9.925445 -0.06329413 -3.6709652 - -12.397416 -12.719869 -1.395601 2.1150916 5.7381287 - -4.4691963 -3.82819 -0.84233856 -1.1604277 -13.490127 - 8.731719 -20.778936 -11.495662 5.8033476 -4.752041 - 10.833007 -6.717991 4.504732 13.4244375 1.1306485 - 7.3435574 1.400918 14.704036 -9.501399 7.2315617 - -6.417456 1.3333273 11.872697 -0.30664724 8.8845 - 6.5569253 4.7948146 0.03662816 -8.704245 6.224871 - -3.2701402 -11.508579 ] + demo [ 1.4217498 5.626253 -5.342073 1.1773866 3.308055 + 1.756596 5.167894 10.80636 -3.8226728 -5.6141334 + 2.623845 -0.8072968 1.9635103 -7.3128724 0.01103897 + -9.723131 0.6619743 -6.976803 10.213478 7.494748 + 2.9105635 3.8949256 3.7999806 7.1061673 16.905321 + -7.1493764 8.733103 3.4230042 -4.831653 -11.403367 + 11.232214 7.1274667 -4.2828417 2.452362 -5.130748 + -18.177666 -2.6116815 -11.000337 -6.7314315 1.6564683 + 0.7618269 1.1253023 -2.083836 4.725744 -8.782597 + -3.539873 3.814236 5.1420674 2.162061 4.096431 + -6.4162116 12.747448 1.9429878 -15.152943 6.417416 + 16.097002 -9.716668 -1.9920526 -3.3649497 -1.871939 + 11.567354 3.69788 11.258265 7.442363 9.183411 + 4.5281515 -1.2417862 4.3959084 6.6727695 5.8898783 + 7.627124 -0.66919386 -11.889693 -9.208865 -7.4274073 + -3.7776625 6.917234 -9.848748 -2.0944717 -5.135116 + 0.49563864 9.317534 -5.9141874 -1.8098574 -0.11738578 + -7.169265 -1.0578263 -5.7216787 -5.1173844 16.137651 + -4.473626 7.6624317 -0.55381083 9.631587 -6.4704556 + -8.548508 4.3716145 -0.79702514 4.478997 -2.9758704 + 3.272176 2.8382776 5.134597 -9.190781 -0.5657382 + -4.8745747 2.3165567 -5.984303 -2.1798875 0.35541576 + -0.31784213 9.493548 2.1144536 4.358092 -12.089823 + 8.451689 -7.925461 4.6242585 4.4289427 18.692003 + -2.6204622 -5.149185 -0.35821092 8.488551 4.981496 + -9.32683 -2.2544234 6.6417594 1.2119585 10.977129 + 16.555033 3.3238444 9.551863 -1.6676947 -0.79539716 + -8.605674 -0.47356385 2.6741948 -5.359179 -2.6673796 + 0.66607 15.443222 4.740594 -3.4725387 11.592567 + -2.054497 1.7361217 -8.265324 -9.30447 5.4068313 + -1.5180256 -7.746615 -6.089606 0.07112726 -0.34904733 + -8.649895 -9.998958 -2.564841 -0.53999114 2.601808 + -0.31927416 -1.8815292 -2.07215 -3.4105783 -8.2998085 + 1.483641 -15.365992 -8.288208 3.8847756 -3.4876456 + 7.3629923 0.4657332 3.132599 12.438889 -1.8337058 + 4.532936 2.7264361 10.145339 -6.521951 2.897153 + -3.3925855 5.079156 7.759716 4.677565 5.8457737 + 2.402413 7.7071047 3.9711342 -6.390043 6.1268735 + -3.7760346 -11.118123 ] ``` - Python API @@ -97,56 +103,113 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav audio_emb = vector_executor( model='ecapatdnn_voxceleb12', sample_rate=16000, - config=None, + config=None, # Set `config` and `ckpt_path` to None to use pretrained model. ckpt_path=None, audio_file='./85236145389.wav', - force_yes=False, device=paddle.get_device()) print('Audio embedding Result: \n{}'.format(audio_emb)) + + test_emb = vector_executor( + model='ecapatdnn_voxceleb12', + sample_rate=16000, + config=None, # Set `config` and `ckpt_path` to None to use pretrained model. + ckpt_path=None, + audio_file='./123456789.wav', + device=paddle.get_device()) + print('Test embedding Result: \n{}'.format(test_emb)) + + # score range [0, 1] + score = vector_executor.get_embeddings_score(audio_emb, test_emb) + print(f"Eembeddings Score: {score}") ``` - Output: + Output: + ```bash # Vector Result: - [ -5.749211 9.505463 -8.200284 -5.2075014 5.3940268 - -3.04878 1.611095 10.127234 -10.534177 -15.821609 - 1.2032688 -0.35080156 1.2629458 -12.643498 -2.5758228 - -11.343508 2.3385992 -8.719341 14.213509 15.404744 - -0.39327756 6.338786 2.688887 8.7104025 17.469526 - -8.77959 7.0576906 4.648855 -1.3089896 -23.294737 - 8.013747 13.891729 -9.926753 5.655307 -5.9422326 - -22.842539 0.6293588 -18.46266 -10.811862 9.8192625 - 3.0070958 3.8072643 -2.3861165 3.0821571 -14.739942 - 1.7594414 -0.6485091 4.485623 2.0207152 7.264915 - -6.40137 23.63524 2.9711294 -22.708025 9.93719 - 20.354511 -10.324688 -0.700492 -8.783211 -5.27593 - 15.999649 3.3004563 12.747926 15.429879 4.7849145 - 5.6699696 -2.3826702 10.605882 3.9112158 3.1500628 - 15.859915 -2.1832209 -23.908653 -6.4799504 -4.5365124 - -9.224193 14.568347 -10.568833 4.982321 -4.342062 - 0.0914714 12.645902 -5.74285 -3.2141201 -2.7173362 - -6.680575 0.4757669 -5.035051 -6.7964664 16.865469 - -11.54324 7.681869 0.44475392 9.708182 -8.932846 - 0.4123232 -4.361452 1.3948607 9.511665 0.11667654 - 2.9079323 6.049952 9.275183 -18.078873 6.2983274 - -0.7500531 -2.725033 -7.6027865 3.3404543 2.990815 - 4.010979 11.000591 -2.8873312 7.1352735 -16.79663 - 18.495346 -14.293832 7.89578 2.2714825 22.976387 - -4.875734 -3.0836344 -2.9999814 13.751918 6.448228 - -11.924197 2.171869 2.0423572 -6.173772 10.778437 - 25.77281 -4.9495463 14.57806 0.3044315 2.6132357 - -7.591999 -2.076944 9.025118 1.7834753 -3.1799617 - -4.9401326 23.465864 5.1685796 -9.018578 9.037825 - -4.4150195 6.859591 -12.274467 -0.88911164 5.186309 - -3.9988663 -13.638606 -9.925445 -0.06329413 -3.6709652 - -12.397416 -12.719869 -1.395601 2.1150916 5.7381287 - -4.4691963 -3.82819 -0.84233856 -1.1604277 -13.490127 - 8.731719 -20.778936 -11.495662 5.8033476 -4.752041 - 10.833007 -6.717991 4.504732 13.4244375 1.1306485 - 7.3435574 1.400918 14.704036 -9.501399 7.2315617 - -6.417456 1.3333273 11.872697 -0.30664724 8.8845 - 6.5569253 4.7948146 0.03662816 -8.704245 6.224871 - -3.2701402 -11.508579 ] + Audio embedding Result: + [ 1.4217498 5.626253 -5.342073 1.1773866 3.308055 + 1.756596 5.167894 10.80636 -3.8226728 -5.6141334 + 2.623845 -0.8072968 1.9635103 -7.3128724 0.01103897 + -9.723131 0.6619743 -6.976803 10.213478 7.494748 + 2.9105635 3.8949256 3.7999806 7.1061673 16.905321 + -7.1493764 8.733103 3.4230042 -4.831653 -11.403367 + 11.232214 7.1274667 -4.2828417 2.452362 -5.130748 + -18.177666 -2.6116815 -11.000337 -6.7314315 1.6564683 + 0.7618269 1.1253023 -2.083836 4.725744 -8.782597 + -3.539873 3.814236 5.1420674 2.162061 4.096431 + -6.4162116 12.747448 1.9429878 -15.152943 6.417416 + 16.097002 -9.716668 -1.9920526 -3.3649497 -1.871939 + 11.567354 3.69788 11.258265 7.442363 9.183411 + 4.5281515 -1.2417862 4.3959084 6.6727695 5.8898783 + 7.627124 -0.66919386 -11.889693 -9.208865 -7.4274073 + -3.7776625 6.917234 -9.848748 -2.0944717 -5.135116 + 0.49563864 9.317534 -5.9141874 -1.8098574 -0.11738578 + -7.169265 -1.0578263 -5.7216787 -5.1173844 16.137651 + -4.473626 7.6624317 -0.55381083 9.631587 -6.4704556 + -8.548508 4.3716145 -0.79702514 4.478997 -2.9758704 + 3.272176 2.8382776 5.134597 -9.190781 -0.5657382 + -4.8745747 2.3165567 -5.984303 -2.1798875 0.35541576 + -0.31784213 9.493548 2.1144536 4.358092 -12.089823 + 8.451689 -7.925461 4.6242585 4.4289427 18.692003 + -2.6204622 -5.149185 -0.35821092 8.488551 4.981496 + -9.32683 -2.2544234 6.6417594 1.2119585 10.977129 + 16.555033 3.3238444 9.551863 -1.6676947 -0.79539716 + -8.605674 -0.47356385 2.6741948 -5.359179 -2.6673796 + 0.66607 15.443222 4.740594 -3.4725387 11.592567 + -2.054497 1.7361217 -8.265324 -9.30447 5.4068313 + -1.5180256 -7.746615 -6.089606 0.07112726 -0.34904733 + -8.649895 -9.998958 -2.564841 -0.53999114 2.601808 + -0.31927416 -1.8815292 -2.07215 -3.4105783 -8.2998085 + 1.483641 -15.365992 -8.288208 3.8847756 -3.4876456 + 7.3629923 0.4657332 3.132599 12.438889 -1.8337058 + 4.532936 2.7264361 10.145339 -6.521951 2.897153 + -3.3925855 5.079156 7.759716 4.677565 5.8457737 + 2.402413 7.7071047 3.9711342 -6.390043 6.1268735 + -3.7760346 -11.118123 ] + # get the test embedding + Test embedding Result: + [ -1.902964 2.0690894 -8.034194 3.5472693 0.18089125 + 6.9085927 1.4097427 -1.9487704 -10.021278 -0.20755845 + -8.04332 4.344489 2.3200977 -14.306299 5.184692 + -11.55602 -3.8497238 0.6444722 1.2833948 2.6766639 + 0.5878921 0.7946299 1.7207596 2.5791872 14.998469 + -1.3385371 15.031221 -0.8006958 1.99287 -9.52007 + 2.435466 4.003221 -4.33817 -4.898601 -5.304714 + -18.033886 10.790787 -12.784645 -5.641755 2.9761686 + -10.566622 1.4839455 6.152458 -5.7195854 2.8603241 + 6.112133 8.489869 5.5958056 1.2836679 -1.2293907 + 0.89927405 7.0288725 -2.854029 -0.9782962 5.8255906 + 14.905906 -5.025907 0.7866458 -4.2444224 -16.354029 + 10.521315 0.9604709 -3.3257897 7.144871 -13.592733 + -8.568869 -1.7953678 0.26313916 10.916714 -6.9374123 + 1.857403 -6.2746415 2.8154466 -7.2338667 -2.293357 + -0.05452765 5.4287076 5.0849075 -6.690375 -1.6183422 + 3.654291 0.94352573 -9.200294 -5.4749465 -3.5235846 + 1.3420814 4.240421 -2.772944 -2.8451524 16.311104 + 4.2969875 -1.762936 -12.5758915 8.595198 -0.8835239 + -1.5708797 1.568961 1.1413603 3.5032008 -0.45251232 + -6.786333 16.89443 5.3366146 -8.789056 0.6355629 + 3.2579517 -3.328322 7.5969577 0.66025066 -6.550468 + -9.148656 2.020372 -0.4615173 1.1965656 -3.8764873 + 11.6562195 -6.0750933 12.182899 3.2218833 0.81969476 + 5.570001 -3.8459578 -7.205299 7.9262037 -7.6611166 + -5.249467 -2.2671914 7.2658715 -13.298164 4.821147 + -2.7263982 11.691089 -3.8918593 -2.838112 -1.0336838 + -3.8034165 2.8536487 -5.60398 -1.1972581 1.3455094 + -3.4903061 2.2408795 5.5010734 -3.970756 11.99696 + -7.8858757 0.43160373 -5.5059714 4.3426995 16.322706 + 11.635366 0.72157705 -9.245714 -3.91465 -4.449838 + -1.5716927 7.713747 -2.2430465 -6.198303 -13.481864 + 2.8156567 -5.7812386 5.1456156 2.7289324 -14.505571 + 13.270688 3.448231 -7.0659585 4.5886116 -4.466099 + -0.296428 -11.463529 -2.6076477 14.110243 -6.9725137 + -1.9962958 2.7119343 19.391657 0.01961198 14.607133 + -1.6695905 -4.391516 1.3131028 -6.670972 -5.888604 + 12.0612335 5.9285784 3.3715196 1.492534 10.723728 + -0.95514804 -12.085431 ] + # get the score between enroll and test + Eembeddings Score: 0.4292638301849365 ``` ### 4.Pretrained Models diff --git a/demos/speaker_verification/README_cn.md b/demos/speaker_verification/README_cn.md index fe8949b3ca6d9de77e5095d6bc55844133b73f52..db382f298df74c73ef5fcbd5a3fb64fb2fa1c44f 100644 --- a/demos/speaker_verification/README_cn.md +++ b/demos/speaker_verification/README_cn.md @@ -29,6 +29,11 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav paddlespeech vector --task spk --input vec.job echo -e "demo2 85236145389.wav \n demo3 85236145389.wav" | paddlespeech vector --task spk + + paddlespeech vector --task score --input "./85236145389.wav ./123456789.wav" + + echo -e "demo4 85236145389.wav 85236145389.wav \n demo5 85236145389.wav 123456789.wav" > vec.job + paddlespeech vector --task score --input vec.job ``` 使用方法: @@ -37,6 +42,7 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav ``` 参数: - `input`(必须输入):用于识别的音频文件。 + - `task` (必须输入): 用于指定 `vector` 处理的具体任务,默认是 `spk`。 - `model`:声纹任务的模型,默认值:`ecapatdnn_voxceleb12`。 - `sample_rate`:音频采样率,默认值:`16000`。 - `config`:声纹任务的参数文件,若不设置则使用预训练模型中的默认配置,默认值:`None`。 @@ -45,45 +51,45 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav 输出: ```bash - demo [ -5.749211 9.505463 -8.200284 -5.2075014 5.3940268 - -3.04878 1.611095 10.127234 -10.534177 -15.821609 - 1.2032688 -0.35080156 1.2629458 -12.643498 -2.5758228 - -11.343508 2.3385992 -8.719341 14.213509 15.404744 - -0.39327756 6.338786 2.688887 8.7104025 17.469526 - -8.77959 7.0576906 4.648855 -1.3089896 -23.294737 - 8.013747 13.891729 -9.926753 5.655307 -5.9422326 - -22.842539 0.6293588 -18.46266 -10.811862 9.8192625 - 3.0070958 3.8072643 -2.3861165 3.0821571 -14.739942 - 1.7594414 -0.6485091 4.485623 2.0207152 7.264915 - -6.40137 23.63524 2.9711294 -22.708025 9.93719 - 20.354511 -10.324688 -0.700492 -8.783211 -5.27593 - 15.999649 3.3004563 12.747926 15.429879 4.7849145 - 5.6699696 -2.3826702 10.605882 3.9112158 3.1500628 - 15.859915 -2.1832209 -23.908653 -6.4799504 -4.5365124 - -9.224193 14.568347 -10.568833 4.982321 -4.342062 - 0.0914714 12.645902 -5.74285 -3.2141201 -2.7173362 - -6.680575 0.4757669 -5.035051 -6.7964664 16.865469 - -11.54324 7.681869 0.44475392 9.708182 -8.932846 - 0.4123232 -4.361452 1.3948607 9.511665 0.11667654 - 2.9079323 6.049952 9.275183 -18.078873 6.2983274 - -0.7500531 -2.725033 -7.6027865 3.3404543 2.990815 - 4.010979 11.000591 -2.8873312 7.1352735 -16.79663 - 18.495346 -14.293832 7.89578 2.2714825 22.976387 - -4.875734 -3.0836344 -2.9999814 13.751918 6.448228 - -11.924197 2.171869 2.0423572 -6.173772 10.778437 - 25.77281 -4.9495463 14.57806 0.3044315 2.6132357 - -7.591999 -2.076944 9.025118 1.7834753 -3.1799617 - -4.9401326 23.465864 5.1685796 -9.018578 9.037825 - -4.4150195 6.859591 -12.274467 -0.88911164 5.186309 - -3.9988663 -13.638606 -9.925445 -0.06329413 -3.6709652 - -12.397416 -12.719869 -1.395601 2.1150916 5.7381287 - -4.4691963 -3.82819 -0.84233856 -1.1604277 -13.490127 - 8.731719 -20.778936 -11.495662 5.8033476 -4.752041 - 10.833007 -6.717991 4.504732 13.4244375 1.1306485 - 7.3435574 1.400918 14.704036 -9.501399 7.2315617 - -6.417456 1.3333273 11.872697 -0.30664724 8.8845 - 6.5569253 4.7948146 0.03662816 -8.704245 6.224871 - -3.2701402 -11.508579 ] + demo [ 1.4217498 5.626253 -5.342073 1.1773866 3.308055 + 1.756596 5.167894 10.80636 -3.8226728 -5.6141334 + 2.623845 -0.8072968 1.9635103 -7.3128724 0.01103897 + -9.723131 0.6619743 -6.976803 10.213478 7.494748 + 2.9105635 3.8949256 3.7999806 7.1061673 16.905321 + -7.1493764 8.733103 3.4230042 -4.831653 -11.403367 + 11.232214 7.1274667 -4.2828417 2.452362 -5.130748 + -18.177666 -2.6116815 -11.000337 -6.7314315 1.6564683 + 0.7618269 1.1253023 -2.083836 4.725744 -8.782597 + -3.539873 3.814236 5.1420674 2.162061 4.096431 + -6.4162116 12.747448 1.9429878 -15.152943 6.417416 + 16.097002 -9.716668 -1.9920526 -3.3649497 -1.871939 + 11.567354 3.69788 11.258265 7.442363 9.183411 + 4.5281515 -1.2417862 4.3959084 6.6727695 5.8898783 + 7.627124 -0.66919386 -11.889693 -9.208865 -7.4274073 + -3.7776625 6.917234 -9.848748 -2.0944717 -5.135116 + 0.49563864 9.317534 -5.9141874 -1.8098574 -0.11738578 + -7.169265 -1.0578263 -5.7216787 -5.1173844 16.137651 + -4.473626 7.6624317 -0.55381083 9.631587 -6.4704556 + -8.548508 4.3716145 -0.79702514 4.478997 -2.9758704 + 3.272176 2.8382776 5.134597 -9.190781 -0.5657382 + -4.8745747 2.3165567 -5.984303 -2.1798875 0.35541576 + -0.31784213 9.493548 2.1144536 4.358092 -12.089823 + 8.451689 -7.925461 4.6242585 4.4289427 18.692003 + -2.6204622 -5.149185 -0.35821092 8.488551 4.981496 + -9.32683 -2.2544234 6.6417594 1.2119585 10.977129 + 16.555033 3.3238444 9.551863 -1.6676947 -0.79539716 + -8.605674 -0.47356385 2.6741948 -5.359179 -2.6673796 + 0.66607 15.443222 4.740594 -3.4725387 11.592567 + -2.054497 1.7361217 -8.265324 -9.30447 5.4068313 + -1.5180256 -7.746615 -6.089606 0.07112726 -0.34904733 + -8.649895 -9.998958 -2.564841 -0.53999114 2.601808 + -0.31927416 -1.8815292 -2.07215 -3.4105783 -8.2998085 + 1.483641 -15.365992 -8.288208 3.8847756 -3.4876456 + 7.3629923 0.4657332 3.132599 12.438889 -1.8337058 + 4.532936 2.7264361 10.145339 -6.521951 2.897153 + -3.3925855 5.079156 7.759716 4.677565 5.8457737 + 2.402413 7.7071047 3.9711342 -6.390043 6.1268735 + -3.7760346 -11.118123 ] ``` - Python API @@ -98,53 +104,109 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav config=None, # Set `config` and `ckpt_path` to None to use pretrained model. ckpt_path=None, audio_file='./85236145389.wav', - force_yes=False, device=paddle.get_device()) print('Audio embedding Result: \n{}'.format(audio_emb)) + + test_emb = vector_executor( + model='ecapatdnn_voxceleb12', + sample_rate=16000, + config=None, # Set `config` and `ckpt_path` to None to use pretrained model. + ckpt_path=None, + audio_file='./123456789.wav', + device=paddle.get_device()) + print('Test embedding Result: \n{}'.format(test_emb)) + + # score range [0, 1] + score = vector_executor.get_embeddings_score(audio_emb, test_emb) + print(f"Eembeddings Score: {score}") ``` 输出: ```bash # Vector Result: - [ -5.749211 9.505463 -8.200284 -5.2075014 5.3940268 - -3.04878 1.611095 10.127234 -10.534177 -15.821609 - 1.2032688 -0.35080156 1.2629458 -12.643498 -2.5758228 - -11.343508 2.3385992 -8.719341 14.213509 15.404744 - -0.39327756 6.338786 2.688887 8.7104025 17.469526 - -8.77959 7.0576906 4.648855 -1.3089896 -23.294737 - 8.013747 13.891729 -9.926753 5.655307 -5.9422326 - -22.842539 0.6293588 -18.46266 -10.811862 9.8192625 - 3.0070958 3.8072643 -2.3861165 3.0821571 -14.739942 - 1.7594414 -0.6485091 4.485623 2.0207152 7.264915 - -6.40137 23.63524 2.9711294 -22.708025 9.93719 - 20.354511 -10.324688 -0.700492 -8.783211 -5.27593 - 15.999649 3.3004563 12.747926 15.429879 4.7849145 - 5.6699696 -2.3826702 10.605882 3.9112158 3.1500628 - 15.859915 -2.1832209 -23.908653 -6.4799504 -4.5365124 - -9.224193 14.568347 -10.568833 4.982321 -4.342062 - 0.0914714 12.645902 -5.74285 -3.2141201 -2.7173362 - -6.680575 0.4757669 -5.035051 -6.7964664 16.865469 - -11.54324 7.681869 0.44475392 9.708182 -8.932846 - 0.4123232 -4.361452 1.3948607 9.511665 0.11667654 - 2.9079323 6.049952 9.275183 -18.078873 6.2983274 - -0.7500531 -2.725033 -7.6027865 3.3404543 2.990815 - 4.010979 11.000591 -2.8873312 7.1352735 -16.79663 - 18.495346 -14.293832 7.89578 2.2714825 22.976387 - -4.875734 -3.0836344 -2.9999814 13.751918 6.448228 - -11.924197 2.171869 2.0423572 -6.173772 10.778437 - 25.77281 -4.9495463 14.57806 0.3044315 2.6132357 - -7.591999 -2.076944 9.025118 1.7834753 -3.1799617 - -4.9401326 23.465864 5.1685796 -9.018578 9.037825 - -4.4150195 6.859591 -12.274467 -0.88911164 5.186309 - -3.9988663 -13.638606 -9.925445 -0.06329413 -3.6709652 - -12.397416 -12.719869 -1.395601 2.1150916 5.7381287 - -4.4691963 -3.82819 -0.84233856 -1.1604277 -13.490127 - 8.731719 -20.778936 -11.495662 5.8033476 -4.752041 - 10.833007 -6.717991 4.504732 13.4244375 1.1306485 - 7.3435574 1.400918 14.704036 -9.501399 7.2315617 - -6.417456 1.3333273 11.872697 -0.30664724 8.8845 - 6.5569253 4.7948146 0.03662816 -8.704245 6.224871 - -3.2701402 -11.508579 ] + Audio embedding Result: + [ 1.4217498 5.626253 -5.342073 1.1773866 3.308055 + 1.756596 5.167894 10.80636 -3.8226728 -5.6141334 + 2.623845 -0.8072968 1.9635103 -7.3128724 0.01103897 + -9.723131 0.6619743 -6.976803 10.213478 7.494748 + 2.9105635 3.8949256 3.7999806 7.1061673 16.905321 + -7.1493764 8.733103 3.4230042 -4.831653 -11.403367 + 11.232214 7.1274667 -4.2828417 2.452362 -5.130748 + -18.177666 -2.6116815 -11.000337 -6.7314315 1.6564683 + 0.7618269 1.1253023 -2.083836 4.725744 -8.782597 + -3.539873 3.814236 5.1420674 2.162061 4.096431 + -6.4162116 12.747448 1.9429878 -15.152943 6.417416 + 16.097002 -9.716668 -1.9920526 -3.3649497 -1.871939 + 11.567354 3.69788 11.258265 7.442363 9.183411 + 4.5281515 -1.2417862 4.3959084 6.6727695 5.8898783 + 7.627124 -0.66919386 -11.889693 -9.208865 -7.4274073 + -3.7776625 6.917234 -9.848748 -2.0944717 -5.135116 + 0.49563864 9.317534 -5.9141874 -1.8098574 -0.11738578 + -7.169265 -1.0578263 -5.7216787 -5.1173844 16.137651 + -4.473626 7.6624317 -0.55381083 9.631587 -6.4704556 + -8.548508 4.3716145 -0.79702514 4.478997 -2.9758704 + 3.272176 2.8382776 5.134597 -9.190781 -0.5657382 + -4.8745747 2.3165567 -5.984303 -2.1798875 0.35541576 + -0.31784213 9.493548 2.1144536 4.358092 -12.089823 + 8.451689 -7.925461 4.6242585 4.4289427 18.692003 + -2.6204622 -5.149185 -0.35821092 8.488551 4.981496 + -9.32683 -2.2544234 6.6417594 1.2119585 10.977129 + 16.555033 3.3238444 9.551863 -1.6676947 -0.79539716 + -8.605674 -0.47356385 2.6741948 -5.359179 -2.6673796 + 0.66607 15.443222 4.740594 -3.4725387 11.592567 + -2.054497 1.7361217 -8.265324 -9.30447 5.4068313 + -1.5180256 -7.746615 -6.089606 0.07112726 -0.34904733 + -8.649895 -9.998958 -2.564841 -0.53999114 2.601808 + -0.31927416 -1.8815292 -2.07215 -3.4105783 -8.2998085 + 1.483641 -15.365992 -8.288208 3.8847756 -3.4876456 + 7.3629923 0.4657332 3.132599 12.438889 -1.8337058 + 4.532936 2.7264361 10.145339 -6.521951 2.897153 + -3.3925855 5.079156 7.759716 4.677565 5.8457737 + 2.402413 7.7071047 3.9711342 -6.390043 6.1268735 + -3.7760346 -11.118123 ] + # get the test embedding + Test embedding Result: + [ -1.902964 2.0690894 -8.034194 3.5472693 0.18089125 + 6.9085927 1.4097427 -1.9487704 -10.021278 -0.20755845 + -8.04332 4.344489 2.3200977 -14.306299 5.184692 + -11.55602 -3.8497238 0.6444722 1.2833948 2.6766639 + 0.5878921 0.7946299 1.7207596 2.5791872 14.998469 + -1.3385371 15.031221 -0.8006958 1.99287 -9.52007 + 2.435466 4.003221 -4.33817 -4.898601 -5.304714 + -18.033886 10.790787 -12.784645 -5.641755 2.9761686 + -10.566622 1.4839455 6.152458 -5.7195854 2.8603241 + 6.112133 8.489869 5.5958056 1.2836679 -1.2293907 + 0.89927405 7.0288725 -2.854029 -0.9782962 5.8255906 + 14.905906 -5.025907 0.7866458 -4.2444224 -16.354029 + 10.521315 0.9604709 -3.3257897 7.144871 -13.592733 + -8.568869 -1.7953678 0.26313916 10.916714 -6.9374123 + 1.857403 -6.2746415 2.8154466 -7.2338667 -2.293357 + -0.05452765 5.4287076 5.0849075 -6.690375 -1.6183422 + 3.654291 0.94352573 -9.200294 -5.4749465 -3.5235846 + 1.3420814 4.240421 -2.772944 -2.8451524 16.311104 + 4.2969875 -1.762936 -12.5758915 8.595198 -0.8835239 + -1.5708797 1.568961 1.1413603 3.5032008 -0.45251232 + -6.786333 16.89443 5.3366146 -8.789056 0.6355629 + 3.2579517 -3.328322 7.5969577 0.66025066 -6.550468 + -9.148656 2.020372 -0.4615173 1.1965656 -3.8764873 + 11.6562195 -6.0750933 12.182899 3.2218833 0.81969476 + 5.570001 -3.8459578 -7.205299 7.9262037 -7.6611166 + -5.249467 -2.2671914 7.2658715 -13.298164 4.821147 + -2.7263982 11.691089 -3.8918593 -2.838112 -1.0336838 + -3.8034165 2.8536487 -5.60398 -1.1972581 1.3455094 + -3.4903061 2.2408795 5.5010734 -3.970756 11.99696 + -7.8858757 0.43160373 -5.5059714 4.3426995 16.322706 + 11.635366 0.72157705 -9.245714 -3.91465 -4.449838 + -1.5716927 7.713747 -2.2430465 -6.198303 -13.481864 + 2.8156567 -5.7812386 5.1456156 2.7289324 -14.505571 + 13.270688 3.448231 -7.0659585 4.5886116 -4.466099 + -0.296428 -11.463529 -2.6076477 14.110243 -6.9725137 + -1.9962958 2.7119343 19.391657 0.01961198 14.607133 + -1.6695905 -4.391516 1.3131028 -6.670972 -5.888604 + 12.0612335 5.9285784 3.3715196 1.492534 10.723728 + -0.95514804 -12.085431 ] + # get the score between enroll and test + Eembeddings Score: 0.4292638301849365 ``` ### 4.预训练模型 diff --git a/demos/speaker_verification/run.sh b/demos/speaker_verification/run.sh index 856886d333cd30f983576875e809ed2016a51f50..6140f7f38111978d464c58cafd35aa9c4c0a7cb7 100644 --- a/demos/speaker_verification/run.sh +++ b/demos/speaker_verification/run.sh @@ -1,6 +1,9 @@ #!/bin/bash wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav +wget -c https://paddlespeech.bj.bcebos.com/vector/audio/123456789.wav -# asr -paddlespeech vector --task spk --input ./85236145389.wav \ No newline at end of file +# vector +paddlespeech vector --task spk --input ./85236145389.wav + +paddlespeech vector --task score --input "./85236145389.wav ./123456789.wav" diff --git a/docs/source/released_model.md b/docs/source/released_model.md index 9a423e03ecf685dc853119be2c69b9219ea1536a..1cbe398956797a1f54f2f210cee7dc04af35e3aa 100644 --- a/docs/source/released_model.md +++ b/docs/source/released_model.md @@ -6,7 +6,7 @@ ### Speech Recognition Model Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech | Example Link :-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----: | :-----: | :-----: -[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.080 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) +[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.078 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) [Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0) [Conformer Online Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_chunk_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0565 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1) [Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0483 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1) @@ -37,8 +37,8 @@ Model Type | Dataset| Example Link | Pretrained Models|Static Models|Size (stati Tacotron2|LJSpeech|[tacotron2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts0)|[tacotron2_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.2.0.zip)||| Tacotron2|CSMSC|[tacotron2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts0)|[tacotron2_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_ckpt_0.2.0.zip)|[tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip)|103MB| TransformerTTS| LJSpeech| [transformer-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts1)|[transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip)||| -SpeedySpeech| CSMSC | [speedyspeech-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts2) |[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip)|[speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip)|12MB| -FastSpeech2| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)|[fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip)|157MB| +SpeedySpeech| CSMSC | [speedyspeech-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts2) |[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip)|[speedyspeech_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_static_0.2.0.zip)|12MB| +FastSpeech2| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)|[fastspeech2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_static_0.2.0.zip)|157MB| FastSpeech2-Conformer| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_conformer_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip)||| FastSpeech2| AISHELL-3 |[fastspeech2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3)|[fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_ckpt_0.4.zip)||| FastSpeech2| LJSpeech |[fastspeech2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts3)|[fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)||| @@ -80,7 +80,7 @@ PANN | ESC-50 |[pann-esc50](../../examples/esc50/cls0)|[esc50_cnn6.tar.gz](https Model Type | Dataset| Example Link | Pretrained Models | Static Models :-------------:| :------------:| :-----: | :-----: | :-----: -PANN | VoxCeleb| [voxceleb_ecapatdnn](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/voxceleb/sv0) | [ecapatdnn.tar.gz](https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz) | - +PANN | VoxCeleb| [voxceleb_ecapatdnn](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/voxceleb/sv0) | [ecapatdnn.tar.gz](https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz) | - ## Punctuation Restoration Models Model Type | Dataset| Example Link | Pretrained Models diff --git a/examples/aishell/asr0/README.md b/examples/aishell/asr0/README.md index bb45d8df0423ccb3f56a81229b25cf5e459696a8..16489992d93b088862368da45a7bee675246870b 100644 --- a/examples/aishell/asr0/README.md +++ b/examples/aishell/asr0/README.md @@ -173,12 +173,7 @@ bash local/data.sh --stage 2 --stop_stage 2 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1 ``` -The performance of the released models are shown below: - -| Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech | -| :----------------------------: | :-------------: | :---------: | -----: | :------------------------------------------------- | :---- | :--- | :-------------- | -| Ds2 Online Aishell ASR0 Model | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.080 | - | 151 h | -| Ds2 Offline Aishell ASR0 Model | Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers | 0.064 | - | 151 h | +The performance of the released models are shown in [this](./RESULTS.md) ## Stage 4: Static graph model Export This stage is to transform dygraph to static graph. ```bash diff --git a/examples/aishell/asr0/RESULTS.md b/examples/aishell/asr0/RESULTS.md index 5841a85220da92dd55ffb980e7e24ac398574709..8af3d66d17efa8fa62a276c67618cd45b03fd77d 100644 --- a/examples/aishell/asr0/RESULTS.md +++ b/examples/aishell/asr0/RESULTS.md @@ -4,15 +4,16 @@ | Model | Number of Params | Release | Config | Test set | Valid Loss | CER | | --- | --- | --- | --- | --- | --- | --- | -| DeepSpeech2 | 45.18M | 2.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.994938373565674 | 0.080 | +| DeepSpeech2 | 45.18M | r0.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.708217620849609| 0.078 | +| DeepSpeech2 | 45.18M | v2.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.994938373565674 | 0.080 | ## Deepspeech2 Non-Streaming | Model | Number of Params | Release | Config | Test set | Valid Loss | CER | | --- | --- | --- | --- | --- | --- | --- | -| DeepSpeech2 | 58.4M | 2.2.0 | conf/deepspeech2.yaml + spec aug | test | 5.738585948944092 | 0.064000 | -| DeepSpeech2 | 58.4M | 2.1.0 | conf/deepspeech2.yaml + spec aug | test | 7.483316898345947 | 0.077860 | -| DeepSpeech2 | 58.4M | 2.1.0 | conf/deepspeech2.yaml | test | 7.299022197723389 | 0.078671 | -| DeepSpeech2 | 58.4M | 2.0.0 | conf/deepspeech2.yaml | test | - | 0.078977 | +| DeepSpeech2 | 58.4M | v2.2.0 | conf/deepspeech2.yaml + spec aug | test | 5.738585948944092 | 0.064000 | +| DeepSpeech2 | 58.4M | v2.1.0 | conf/deepspeech2.yaml + spec aug | test | 7.483316898345947 | 0.077860 | +| DeepSpeech2 | 58.4M | v2.1.0 | conf/deepspeech2.yaml | test | 7.299022197723389 | 0.078671 | +| DeepSpeech2 | 58.4M | v2.0.0 | conf/deepspeech2.yaml | test | - | 0.078977 | | --- | --- | --- | --- | --- | --- | --- | -| DeepSpeech2 | 58.4M | 1.8.5 | - | test | - | 0.080447 | +| DeepSpeech2 | 58.4M | v1.8.5 | - | test | - | 0.080447 | diff --git a/examples/aishell3/vc0/README.md b/examples/aishell3/vc0/README.md index 664ec1ac36349b64475d71b8f5ee4d916b0e47b4..925663ab1aefce9d6db6f4908d011cab4f77db79 100644 --- a/examples/aishell3/vc0/README.md +++ b/examples/aishell3/vc0/README.md @@ -118,7 +118,7 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_outpu ``` ## Pretrained Model -[tacotron2_aishell3_ckpt_vc0_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_aishell3_ckpt_vc0_0.2.0.zip) +- [tacotron2_aishell3_ckpt_vc0_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_aishell3_ckpt_vc0_0.2.0.zip) Model | Step | eval/loss | eval/l1_loss | eval/mse_loss | eval/bce_loss| eval/attn_loss diff --git a/examples/aishell3/vc1/README.md b/examples/aishell3/vc1/README.md index 04b83a5ffae71082da84958e8d20e982ffeb396f..8ab0f9c8cff833fdaaebdc408ef1c5841381e7ce 100644 --- a/examples/aishell3/vc1/README.md +++ b/examples/aishell3/vc1/README.md @@ -119,7 +119,7 @@ ref_audio CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_output_path} ${ckpt_name} ${ge2e_params_path} ${ref_audio_dir} ``` ## Pretrained Model -[fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip) +- [fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip) Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------: diff --git a/examples/aishell3/voc1/README.md b/examples/aishell3/voc1/README.md index dad464092d197de0065a3876ccc3605719de75ff..eb30e7c403c30dfeb1d466f558818eabda8dabfb 100644 --- a/examples/aishell3/voc1/README.md +++ b/examples/aishell3/voc1/README.md @@ -137,7 +137,8 @@ optional arguments: 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ## Pretrained Models -Pretrained models can be downloaded here [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip). +Pretrained models can be downloaded here: +- [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip) Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss:| eval/spectral_convergence_loss :-------------:| :------------:| :-----: | :-----: | :--------: diff --git a/examples/aishell3/voc5/README.md b/examples/aishell3/voc5/README.md index ebe2530beec58e87fa55f8ce6a203260858a21f8..c957c4a3aab385cd94adf03fc2cf12afd5bb351e 100644 --- a/examples/aishell3/voc5/README.md +++ b/examples/aishell3/voc5/README.md @@ -136,7 +136,8 @@ optional arguments: 4. `--output-dir` is the directory to save the synthesized audio files. 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ## Pretrained Models -The pretrained model can be downloaded here [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip). +The pretrained model can be downloaded here: +- [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip) Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss diff --git a/examples/csmsc/tts0/README.md b/examples/csmsc/tts0/README.md index 0129329aebd95010fdc6045be2151f0f6af8ea25..01376bd61e08055b6da9e71b4cfb812b8e35c5c9 100644 --- a/examples/csmsc/tts0/README.md +++ b/examples/csmsc/tts0/README.md @@ -212,7 +212,8 @@ optional arguments: Pretrained Tacotron2 model with no silence in the edge of audios: - [tacotron2_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_ckpt_0.2.0.zip) -The static model can be downloaded here [tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip). +The static model can be downloaded here: +- [tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip) Model | Step | eval/loss | eval/l1_loss | eval/mse_loss | eval/bce_loss| eval/attn_loss diff --git a/examples/csmsc/tts2/README.md b/examples/csmsc/tts2/README.md index 5f31f7b369429a50a1a83185cf14eb687a88861a..4fbe34cbf739314caa468ea1acea5139fc4bc131 100644 --- a/examples/csmsc/tts2/README.md +++ b/examples/csmsc/tts2/README.md @@ -221,9 +221,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} ``` ## Pretrained Model -Pretrained SpeedySpeech model with no silence in the edge of audios[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip). +Pretrained SpeedySpeech model with no silence in the edge of audios: +- [speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip) -The static model can be downloaded here [speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip). +The static model can be downloaded here: +- [speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip) +- [speedyspeech_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_static_0.2.0.zip) Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/ssim_loss :-------------:| :------------:| :-----: | :-----: | :--------:|:--------: diff --git a/examples/csmsc/tts3/README.md b/examples/csmsc/tts3/README.md index ae8f7af607253861f96e5c59ac23f8e7c0d69c0e..bc672f66f1eea154436323fcef236456c3cd17b5 100644 --- a/examples/csmsc/tts3/README.md +++ b/examples/csmsc/tts3/README.md @@ -232,6 +232,9 @@ The static model can be downloaded here: - [fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip) - [fastspeech2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_static_0.2.0.zip) +The ONNX model can be downloaded here: +- [fastspeech2_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_onnx_0.2.0.zip) + Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------: default| 2(gpu) x 76000|1.0991|0.59132|0.035815|0.31915|0.15287| diff --git a/examples/csmsc/tts3/local/ort_predict.sh b/examples/csmsc/tts3/local/ort_predict.sh new file mode 100755 index 0000000000000000000000000000000000000000..3154f6e5abcd9baa213b2fc41c24adc6c6b453ed --- /dev/null +++ b/examples/csmsc/tts3/local/ort_predict.sh @@ -0,0 +1,31 @@ +train_output_path=$1 + +stage=0 +stop_stage=0 + +# only support default_fastspeech2 + hifigan/mb_melgan now! + +# synthesize from metadata +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + python3 ${BIN_DIR}/../ort_predict.py \ + --inference_dir=${train_output_path}/inference_onnx \ + --am=fastspeech2_csmsc \ + --voc=hifigan_csmsc \ + --test_metadata=dump/test/norm/metadata.jsonl \ + --output_dir=${train_output_path}/onnx_infer_out \ + --device=cpu \ + --cpu_threads=2 +fi + +# e2e, synthesize from text +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + python3 ${BIN_DIR}/../ort_predict_e2e.py \ + --inference_dir=${train_output_path}/inference_onnx \ + --am=fastspeech2_csmsc \ + --voc=hifigan_csmsc \ + --output_dir=${train_output_path}/onnx_infer_out_e2e \ + --text=${BIN_DIR}/../csmsc_test.txt \ + --phones_dict=dump/phone_id_map.txt \ + --device=cpu \ + --cpu_threads=2 +fi diff --git a/examples/csmsc/tts3/local/paddle2onnx.sh b/examples/csmsc/tts3/local/paddle2onnx.sh new file mode 100755 index 0000000000000000000000000000000000000000..505f3b663622063df9f40a6e7d13a6d8553a5025 --- /dev/null +++ b/examples/csmsc/tts3/local/paddle2onnx.sh @@ -0,0 +1,22 @@ +train_output_path=$1 +model_dir=$2 +output_dir=$3 +model=$4 + +enable_dev_version=True + +model_name=${model%_*} +echo model_name: ${model_name} + +if [ ${model_name} = 'mb_melgan' ] ;then + enable_dev_version=False +fi + +mkdir -p ${train_output_path}/${output_dir} + +paddle2onnx \ + --model_dir ${train_output_path}/${model_dir} \ + --model_filename ${model}.pdmodel \ + --params_filename ${model}.pdiparams \ + --save_file ${train_output_path}/${output_dir}/${model}.onnx \ + --enable_dev_version ${enable_dev_version} \ No newline at end of file diff --git a/examples/csmsc/tts3/run.sh b/examples/csmsc/tts3/run.sh index e1a149b6524716dbf68c8b898cd8d8e5b22e57f6..b617d53527d16abd536a65226c93e7ae24592bc3 100755 --- a/examples/csmsc/tts3/run.sh +++ b/examples/csmsc/tts3/run.sh @@ -41,3 +41,25 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1 fi +# paddle2onnx, please make sure the static models are in ${train_output_path}/inference first +# we have only tested the following models so far +if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then + # install paddle2onnx + version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}') + if [[ -z "$version" || ${version} != '0.9.4' ]]; then + pip install paddle2onnx==0.9.4 + fi + ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx fastspeech2_csmsc + ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx hifigan_csmsc + ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx mb_melgan_csmsc +fi + +# inference with onnxruntime, use fastspeech2 + hifigan by default +if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then + # install onnxruntime + version=$(echo `pip list |grep "onnxruntime"` |awk -F" " '{print $2}') + if [[ -z "$version" || ${version} != '1.10.0' ]]; then + pip install onnxruntime==1.10.0 + fi + ./local/ort_predict.sh ${train_output_path} +fi diff --git a/examples/csmsc/voc1/README.md b/examples/csmsc/voc1/README.md index 5527e80888c12456c2e3dbd13cad8e79a6e93da4..2d6de168a18166bb12fff7c808607644d005ad96 100644 --- a/examples/csmsc/voc1/README.md +++ b/examples/csmsc/voc1/README.md @@ -127,9 +127,11 @@ optional arguments: 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ## Pretrained Models -The pretrained model can be downloaded here [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip). +The pretrained model can be downloaded here: +- [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip) -The static model can be downloaded here [pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip). +The static model can be downloaded here: +- [pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip) Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss| eval/spectral_convergence_loss :-------------:| :------------:| :-----: | :-----: | :--------: diff --git a/examples/csmsc/voc3/README.md b/examples/csmsc/voc3/README.md index 22104a8f215f2c1eca29889778b98ac08575e193..12adaf7f4e2098f86e75c4d155951bccc8969f5e 100644 --- a/examples/csmsc/voc3/README.md +++ b/examples/csmsc/voc3/README.md @@ -152,11 +152,17 @@ TODO: The hyperparameter of `finetune.yaml` is not good enough, a smaller `learning_rate` should be used (more `milestones` should be set). ## Pretrained Models -The pretrained model can be downloaded here [mb_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_ckpt_0.1.1.zip). +The pretrained model can be downloaded here: +- [mb_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_ckpt_0.1.1.zip) -The finetuned model can be downloaded here [mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip). +The finetuned model can be downloaded here: +- [mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip) -The static model can be downloaded here [mb_melgan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_static_0.1.1.zip) +The static model can be downloaded here: +- [mb_melgan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_static_0.1.1.zip) + +The ONNX model can be downloaded here: +- [mb_melgan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_onnx_0.2.0.zip) Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss|eval/spectral_convergence_loss |eval/sub_log_stft_magnitude_loss|eval/sub_spectral_convergence_loss :-------------:| :------------:| :-----: | :-----: | :--------:| :--------:| :--------: diff --git a/examples/csmsc/voc4/README.md b/examples/csmsc/voc4/README.md index b5c6873917602350d4e244fdcfd496d74e6192da..b7add3e574c63b61c22e75ca15289f2b6bc7ce51 100644 --- a/examples/csmsc/voc4/README.md +++ b/examples/csmsc/voc4/README.md @@ -112,7 +112,8 @@ optional arguments: 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ## Pretrained Models -The pretrained model can be downloaded here [style_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/style_melgan/style_melgan_csmsc_ckpt_0.1.1.zip). +The pretrained model can be downloaded here: +- [style_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/style_melgan/style_melgan_csmsc_ckpt_0.1.1.zip) The static model of Style MelGAN is not available now. diff --git a/examples/csmsc/voc5/README.md b/examples/csmsc/voc5/README.md index 21afe6eefad51678568b0b7ab961ceffe33ac779..33e676165a3a2c65c1510d93a55f642bbec92116 100644 --- a/examples/csmsc/voc5/README.md +++ b/examples/csmsc/voc5/README.md @@ -112,9 +112,14 @@ optional arguments: 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ## Pretrained Models -The pretrained model can be downloaded here [hifigan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip). +The pretrained model can be downloaded here: +- [hifigan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip) -The static model can be downloaded here [hifigan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_static_0.1.1.zip). +The static model can be downloaded here: +- [hifigan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_static_0.1.1.zip) + +The ONNX model can be downloaded here: +- [hifigan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_onnx_0.2.0.zip) Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss :-------------:| :------------:| :-----: | :-----: | :--------: diff --git a/examples/csmsc/voc6/README.md b/examples/csmsc/voc6/README.md index 7763b3551422d602f8b7321bb8a0a80b4a0f74e9..26d4523d91c9f9c7036e0ae4ae2053e0dc9402ef 100644 --- a/examples/csmsc/voc6/README.md +++ b/examples/csmsc/voc6/README.md @@ -109,9 +109,11 @@ optional arguments: 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ## Pretrained Models -The pretrained model can be downloaded here [wavernn_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip). +The pretrained model can be downloaded here: +- [wavernn_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip) -The static model can be downloaded here [wavernn_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_static_0.2.0.zip). +The static model can be downloaded here: +- [wavernn_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_static_0.2.0.zip) Model | Step | eval/loss :-------------:|:------------:| :------------: diff --git a/examples/iwslt2012/punc0/README.md b/examples/iwslt2012/punc0/README.md index 74d599a21d9cc392abdbae60fcda81a65ff2e01d..6caa9710b914b50814c027d5af1e1803bc6de113 100644 --- a/examples/iwslt2012/punc0/README.md +++ b/examples/iwslt2012/punc0/README.md @@ -21,7 +21,7 @@ The pretrained model can be downloaded here [ernie_linear_p3_iwslt2012_zh_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/text/ernie_linear_p3_iwslt2012_zh_ckpt_0.1.1.zip). ### Test Result -- Ernie Linear +- Ernie | |COMMA | PERIOD | QUESTION | OVERALL| |:-----:|:-----:|:-----:|:-----:|:-----:| |Precision |0.510955 |0.526462 |0.820755 |0.619391| diff --git a/examples/iwslt2012/punc0/RESULTS.md b/examples/iwslt2012/punc0/RESULTS.md new file mode 100644 index 0000000000000000000000000000000000000000..2e22713d858a9c0f89478857a756a1d8877ff8bd --- /dev/null +++ b/examples/iwslt2012/punc0/RESULTS.md @@ -0,0 +1,9 @@ +# iwslt2012 + +## Ernie + +| |COMMA | PERIOD | QUESTION | OVERALL| +|:-----:|:-----:|:-----:|:-----:|:-----:| +|Precision |0.510955 |0.526462 |0.820755 |0.619391| +|Recall |0.517433 |0.564179 |0.861386 |0.647666| +|F1 |0.514173 |0.544669 |0.840580 |0.633141| diff --git a/examples/ljspeech/tts1/README.md b/examples/ljspeech/tts1/README.md index 4f7680e8456d077a6641d3de6f5d54a7c3dd37cf..7f32522acd3f486a958d4e8640ee88275e7fbb8b 100644 --- a/examples/ljspeech/tts1/README.md +++ b/examples/ljspeech/tts1/README.md @@ -171,7 +171,8 @@ optional arguments: 6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ## Pretrained Model -Pretrained Model can be downloaded here. [transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip) +Pretrained Model can be downloaded here: +- [transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip) TransformerTTS checkpoint contains files listed below. ```text diff --git a/examples/ljspeech/tts3/README.md b/examples/ljspeech/tts3/README.md index f5e919c0fe45bff2fe66a599e8bf1c4030a0c91d..e028fa05d5a1748fab1a4fc3231f6da741701e76 100644 --- a/examples/ljspeech/tts3/README.md +++ b/examples/ljspeech/tts3/README.md @@ -214,7 +214,8 @@ optional arguments: 9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ## Pretrained Model -Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip) +Pretrained FastSpeech2 model with no silence in the edge of audios: +- [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip) Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------: diff --git a/examples/ljspeech/voc0/README.md b/examples/ljspeech/voc0/README.md index 13a50efb54049f6a78c24cf90fc66f00ad2ef0df..41b08d57f42b4d81327a6d74e5b9abd394c19603 100644 --- a/examples/ljspeech/voc0/README.md +++ b/examples/ljspeech/voc0/README.md @@ -50,4 +50,5 @@ Synthesize waveform. 6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ## Pretrained Model -Pretrained Model with residual channel equals 128 can be downloaded here. [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/waveflow/waveflow_ljspeech_ckpt_0.3.zip). +Pretrained Model with residual channel equals 128 can be downloaded here: +- [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/waveflow/waveflow_ljspeech_ckpt_0.3.zip) diff --git a/examples/ljspeech/voc1/README.md b/examples/ljspeech/voc1/README.md index 6fcb2a520d8167daaea24d7574bd3fa809a6090f..4513b2a05a67342a9be8d923fc517a7738eccf83 100644 --- a/examples/ljspeech/voc1/README.md +++ b/examples/ljspeech/voc1/README.md @@ -127,7 +127,8 @@ optional arguments: 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ## Pretrained Model -Pretrained models can be downloaded here. [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip) +Pretrained models can be downloaded here: +- [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip) Parallel WaveGAN checkpoint contains files listed below. diff --git a/examples/ljspeech/voc5/README.md b/examples/ljspeech/voc5/README.md index 9fbb9f74615bd9eef4e54f93542b3a836e9fb000..9b31e2650459c54ee1d6bed286f4f361077331ff 100644 --- a/examples/ljspeech/voc5/README.md +++ b/examples/ljspeech/voc5/README.md @@ -127,7 +127,8 @@ optional arguments: 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ## Pretrained Model -The pretrained model can be downloaded here [hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip). +The pretrained model can be downloaded here: +- [hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip) Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss @@ -143,6 +144,5 @@ hifigan_ljspeech_ckpt_0.2.0 └── snapshot_iter_2500000.pdz # generator parameters of hifigan ``` - ## Acknowledgement We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN. diff --git a/examples/vctk/tts3/README.md b/examples/vctk/tts3/README.md index 157949d1fd24c3d3163ec133819dc3a32a09f7dd..f373ca6a387e53b8395570705f6e8576293055c0 100644 --- a/examples/vctk/tts3/README.md +++ b/examples/vctk/tts3/README.md @@ -217,7 +217,8 @@ optional arguments: 9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ## Pretrained Model -Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip) +Pretrained FastSpeech2 model with no silence in the edge of audios: +- [fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip) FastSpeech2 checkpoint contains files listed below. ```text diff --git a/examples/vctk/voc1/README.md b/examples/vctk/voc1/README.md index 4714f28dc2c7e7aa06857349a24ec6ac2d0423f6..1c3016f885de2bb15bd2af5d4866782cb0a81f80 100644 --- a/examples/vctk/voc1/README.md +++ b/examples/vctk/voc1/README.md @@ -132,7 +132,8 @@ optional arguments: 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ## Pretrained Model -Pretrained models can be downloaded here [pwg_vctk_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.1.1.zip). +Pretrained models can be downloaded here: +- [pwg_vctk_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.1.1.zip) Parallel WaveGAN checkpoint contains files listed below. diff --git a/examples/vctk/voc5/README.md b/examples/vctk/voc5/README.md index b4be341c0e52a74bbf475e22d39fd72bfe737e47..4eb25c02d7f97764d10ab2c6b5e871f06b61b148 100644 --- a/examples/vctk/voc5/README.md +++ b/examples/vctk/voc5/README.md @@ -133,7 +133,8 @@ optional arguments: 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ## Pretrained Model -The pretrained model can be downloaded here [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip). +The pretrained model can be downloaded here: +- [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip) Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss diff --git a/examples/voxceleb/sv0/RESULT.md b/examples/voxceleb/sv0/RESULT.md index c37bcecef9b4276adcd7eb05b14893c48c3bdf96..3a3f67d09655cdc62daef128ca4822c6bb05163c 100644 --- a/examples/voxceleb/sv0/RESULT.md +++ b/examples/voxceleb/sv0/RESULT.md @@ -4,4 +4,4 @@ | Model | Number of Params | Release | Config | dim | Test set | Cosine | Cosine + S-Norm | | --- | --- | --- | --- | --- | --- | --- | ---- | -| ECAPA-TDNN | 85M | 0.1.1 | conf/ecapa_tdnn.yaml |192 | test | 1.15 | 1.06 | +| ECAPA-TDNN | 85M | 0.2.0 | conf/ecapa_tdnn.yaml |192 | test | 1.02 | 0.95 | diff --git a/paddleaudio/paddleaudio/utils/numeric.py b/paddleaudio/paddleaudio/utils/numeric.py new file mode 100644 index 0000000000000000000000000000000000000000..126cada503f83e9d412d8c83c5728c42cf19c52b --- /dev/null +++ b/paddleaudio/paddleaudio/utils/numeric.py @@ -0,0 +1,30 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import numpy as np + + +def pcm16to32(audio: np.ndarray) -> np.ndarray: + """pcm int16 to float32 + + Args: + audio (np.ndarray): Waveform with dtype of int16. + + Returns: + np.ndarray: Waveform with dtype of float32. + """ + if audio.dtype == np.int16: + audio = audio.astype("float32") + bits = np.iinfo(np.int16).bits + audio = audio / (2**(bits - 1)) + return audio diff --git a/paddlespeech/cli/asr/infer.py b/paddlespeech/cli/asr/infer.py index 1fb4be43486fbe896b97d6d6a3ac766c53f208e1..b12b9f6fce89a44564ed66a4346b10032100a4af 100644 --- a/paddlespeech/cli/asr/infer.py +++ b/paddlespeech/cli/asr/infer.py @@ -80,9 +80,9 @@ pretrained_models = { }, "deepspeech2online_aishell-zh-16k": { 'url': - 'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz', + 'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz', 'md5': - 'd5e076217cf60486519f72c217d21b9b', + '23e16c69730a1cb5d735c98c83c21e16', 'cfg_path': 'model.yaml', 'ckpt_path': @@ -426,6 +426,11 @@ class ASRExecutor(BaseExecutor): try: audio, audio_sample_rate = soundfile.read( audio_file, dtype="int16", always_2d=True) + audio_duration = audio.shape[0] / audio_sample_rate + max_duration = 50.0 + if audio_duration >= max_duration: + logger.error("Please input audio file less then 50 seconds.\n") + return except Exception as e: logger.exception(e) logger.error( diff --git a/paddlespeech/cli/vector/infer.py b/paddlespeech/cli/vector/infer.py index 175a9723e1bd811be97de5995996d79f0ef19307..68e832ac74d4dda805a4185ab09a72f2eb7d6413 100644 --- a/paddlespeech/cli/vector/infer.py +++ b/paddlespeech/cli/vector/infer.py @@ -15,6 +15,7 @@ import argparse import os import sys from collections import OrderedDict +from typing import Dict from typing import List from typing import Optional from typing import Union @@ -42,9 +43,9 @@ pretrained_models = { # "paddlespeech vector --task spk --model ecapatdnn_voxceleb12-16k --sr 16000 --input ./input.wav" "ecapatdnn_voxceleb12-16k": { 'url': - 'https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz', + 'https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz', 'md5': - 'a1c0dba7d4de997187786ff517d5b4ec', + 'cc33023c54ab346cd318408f43fcaf95', 'cfg_path': 'conf/model.yaml', # the yaml config path 'ckpt_path': @@ -79,7 +80,7 @@ class VectorExecutor(BaseExecutor): "--task", type=str, default="spk", - choices=["spk"], + choices=["spk", "score"], help="task type in vector domain") self.parser.add_argument( "--input", @@ -147,13 +148,40 @@ class VectorExecutor(BaseExecutor): logger.info(f"task source: {task_source}") # stage 3: process the audio one by one + # we do action according the task type task_result = OrderedDict() has_exceptions = False for id_, input_ in task_source.items(): try: - res = self(input_, model, sample_rate, config, ckpt_path, - device) - task_result[id_] = res + # extract the speaker audio embedding + if parser_args.task == "spk": + logger.info("do vector spk task") + res = self(input_, model, sample_rate, config, ckpt_path, + device) + task_result[id_] = res + elif parser_args.task == "score": + logger.info("do vector score task") + logger.info(f"input content {input_}") + if len(input_.split()) != 2: + logger.error( + f"vector score task input {input_} wav num is not two," + "that is {len(input_.split())}") + sys.exit(-1) + + # get the enroll and test embedding + enroll_audio, test_audio = input_.split() + logger.info( + f"score task, enroll audio: {enroll_audio}, test audio: {test_audio}" + ) + enroll_embedding = self(enroll_audio, model, sample_rate, + config, ckpt_path, device) + test_embedding = self(test_audio, model, sample_rate, + config, ckpt_path, device) + + # get the score + res = self.get_embeddings_score(enroll_embedding, + test_embedding) + task_result[id_] = res except Exception as e: has_exceptions = True task_result[id_] = f'{e.__class__.__name__}: {e}' @@ -172,6 +200,49 @@ class VectorExecutor(BaseExecutor): else: return True + def _get_job_contents( + self, job_input: os.PathLike) -> Dict[str, Union[str, os.PathLike]]: + """ + Read a job input file and return its contents in a dictionary. + Refactor from the Executor._get_job_contents + + Args: + job_input (os.PathLike): The job input file. + + Returns: + Dict[str, str]: Contents of job input. + """ + job_contents = OrderedDict() + with open(job_input) as f: + for line in f: + line = line.strip() + if not line: + continue + k = line.split(' ')[0] + v = ' '.join(line.split(' ')[1:]) + job_contents[k] = v + return job_contents + + def get_embeddings_score(self, enroll_embedding, test_embedding): + """get the enroll embedding and test embedding score + + Args: + enroll_embedding (numpy.array): shape: (emb_size), enroll audio embedding + test_embedding (numpy.array): shape: (emb_size), test audio embedding + + Returns: + score: the score between enroll embedding and test embedding + """ + if not hasattr(self, "score_func"): + self.score_func = paddle.nn.CosineSimilarity(axis=0) + logger.info("create the cosine score function ") + + score = self.score_func( + paddle.to_tensor(enroll_embedding), + paddle.to_tensor(test_embedding)) + + return score.item() + @stats_wrapper def __call__(self, audio_file: os.PathLike, diff --git a/paddlespeech/server/engine/asr/online/asr_engine.py b/paddlespeech/server/engine/asr/online/asr_engine.py index 389175a0a0c257903f8b4c296842923bf1b73cf7..9029aa6e9e45a24f06c6a806bff0c82dd1e84d95 100644 --- a/paddlespeech/server/engine/asr/online/asr_engine.py +++ b/paddlespeech/server/engine/asr/online/asr_engine.py @@ -36,7 +36,7 @@ pretrained_models = { 'url': 'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz', 'md5': - 'd5e076217cf60486519f72c217d21b9b', + '23e16c69730a1cb5d735c98c83c21e16', 'cfg_path': 'model.yaml', 'ckpt_path': diff --git a/paddlespeech/t2s/exps/fastspeech2/preprocess.py b/paddlespeech/t2s/exps/fastspeech2/preprocess.py index 5bda75451b071321e681adceb598f29162b5fb8c..db1842b2e89fe3044e96ca4babb07c1796d06da3 100644 --- a/paddlespeech/t2s/exps/fastspeech2/preprocess.py +++ b/paddlespeech/t2s/exps/fastspeech2/preprocess.py @@ -86,6 +86,9 @@ def process_sentence(config: Dict[str, Any], logmel = mel_extractor.get_log_mel_fbank(wav) # change duration according to mel_length compare_duration_and_mel_length(sentences, utt_id, logmel) + # utt_id may be popped in compare_duration_and_mel_length + if utt_id not in sentences: + return None phones = sentences[utt_id][0] durations = sentences[utt_id][1] num_frames = logmel.shape[0] diff --git a/paddlespeech/t2s/exps/inference.py b/paddlespeech/t2s/exps/inference.py index 1188ddfb132151e095ba0541ecb0fce12ad7e4ab..62602a01f28c4365c80ba6fb01b98cef2572a579 100644 --- a/paddlespeech/t2s/exps/inference.py +++ b/paddlespeech/t2s/exps/inference.py @@ -104,7 +104,7 @@ def get_voc_output(args, voc_predictor, input): def parse_args(): parser = argparse.ArgumentParser( - description="Paddle Infernce with speedyspeech & parallel wavegan.") + description="Paddle Infernce with acoustic model & vocoder.") # acoustic model parser.add_argument( '--am', diff --git a/paddlespeech/t2s/exps/ort_predict.py b/paddlespeech/t2s/exps/ort_predict.py new file mode 100644 index 0000000000000000000000000000000000000000..e8d4d61c32e09983e66346a2b9f6a26a7c269846 --- /dev/null +++ b/paddlespeech/t2s/exps/ort_predict.py @@ -0,0 +1,156 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +from pathlib import Path + +import jsonlines +import numpy as np +import onnxruntime as ort +import soundfile as sf +from timer import timer + +from paddlespeech.t2s.exps.syn_utils import get_test_dataset +from paddlespeech.t2s.utils import str2bool + + +def get_sess(args, filed='am'): + full_name = '' + if filed == 'am': + full_name = args.am + elif filed == 'voc': + full_name = args.voc + model_dir = str(Path(args.inference_dir) / (full_name + ".onnx")) + sess_options = ort.SessionOptions() + sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL + sess_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL + + if args.device == "gpu": + # fastspeech2/mb_melgan can't use trt now! + if args.use_trt: + providers = ['TensorrtExecutionProvider'] + else: + providers = ['CUDAExecutionProvider'] + elif args.device == "cpu": + providers = ['CPUExecutionProvider'] + sess_options.intra_op_num_threads = args.cpu_threads + sess = ort.InferenceSession( + model_dir, providers=providers, sess_options=sess_options) + return sess + + +def ort_predict(args): + # construct dataset for evaluation + with jsonlines.open(args.test_metadata, 'r') as reader: + test_metadata = list(reader) + am_name = args.am[:args.am.rindex('_')] + am_dataset = args.am[args.am.rindex('_') + 1:] + test_dataset = get_test_dataset(args, test_metadata, am_name, am_dataset) + + output_dir = Path(args.output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + + fs = 24000 if am_dataset != 'ljspeech' else 22050 + + # am + am_sess = get_sess(args, filed='am') + + # vocoder + voc_sess = get_sess(args, filed='voc') + + # am warmup + for T in [27, 38, 54]: + data = np.random.randint(1, 266, size=(T, )) + am_sess.run(None, {"text": data}) + + # voc warmup + for T in [227, 308, 544]: + data = np.random.rand(T, 80).astype("float32") + voc_sess.run(None, {"logmel": data}) + print("warm up done!") + + N = 0 + T = 0 + for example in test_dataset: + utt_id = example['utt_id'] + phone_ids = example["text"] + with timer() as t: + mel = am_sess.run(output_names=None, input_feed={'text': phone_ids}) + mel = mel[0] + wav = voc_sess.run(output_names=None, input_feed={'logmel': mel}) + + N += len(wav[0]) + T += t.elapse + speed = len(wav[0]) / t.elapse + rtf = fs / speed + sf.write( + str(output_dir / (utt_id + ".wav")), + np.array(wav)[0], + samplerate=fs) + print( + f"{utt_id}, mel: {mel.shape}, wave: {len(wav[0])}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}." + ) + print(f"generation speed: {N / T}Hz, RTF: {fs / (N / T) }") + + +def parse_args(): + parser = argparse.ArgumentParser(description="Infernce with onnxruntime.") + # acoustic model + parser.add_argument( + '--am', + type=str, + default='fastspeech2_csmsc', + choices=[ + 'fastspeech2_csmsc', + ], + help='Choose acoustic model type of tts task.') + + # voc + parser.add_argument( + '--voc', + type=str, + default='hifigan_csmsc', + choices=['hifigan_csmsc', 'mb_melgan_csmsc'], + help='Choose vocoder type of tts task.') + # other + parser.add_argument( + "--inference_dir", type=str, help="dir to save inference models") + parser.add_argument("--test_metadata", type=str, help="test metadata.") + parser.add_argument("--output_dir", type=str, help="output dir") + + # inference + parser.add_argument( + "--use_trt", + type=str2bool, + default=False, + help="Whether to use inference engin TensorRT.", ) + + parser.add_argument( + "--device", + default="gpu", + choices=["gpu", "cpu"], + help="Device selected for inference.", ) + parser.add_argument('--cpu_threads', type=int, default=1) + + args, _ = parser.parse_known_args() + return args + + +def main(): + args = parse_args() + + ort_predict(args) + + +if __name__ == "__main__": + main() diff --git a/paddlespeech/t2s/exps/ort_predict_e2e.py b/paddlespeech/t2s/exps/ort_predict_e2e.py new file mode 100644 index 0000000000000000000000000000000000000000..8aa04cbc556ad36d7275c2b5ccad3cd9fa5b139b --- /dev/null +++ b/paddlespeech/t2s/exps/ort_predict_e2e.py @@ -0,0 +1,183 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +from pathlib import Path + +import numpy as np +import onnxruntime as ort +import soundfile as sf +from timer import timer + +from paddlespeech.t2s.exps.syn_utils import get_frontend +from paddlespeech.t2s.exps.syn_utils import get_sentences +from paddlespeech.t2s.utils import str2bool + + +def get_sess(args, filed='am'): + full_name = '' + if filed == 'am': + full_name = args.am + elif filed == 'voc': + full_name = args.voc + model_dir = str(Path(args.inference_dir) / (full_name + ".onnx")) + sess_options = ort.SessionOptions() + sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL + sess_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL + + if args.device == "gpu": + # fastspeech2/mb_melgan can't use trt now! + if args.use_trt: + providers = ['TensorrtExecutionProvider'] + else: + providers = ['CUDAExecutionProvider'] + elif args.device == "cpu": + providers = ['CPUExecutionProvider'] + sess_options.intra_op_num_threads = args.cpu_threads + sess = ort.InferenceSession( + model_dir, providers=providers, sess_options=sess_options) + return sess + + +def ort_predict(args): + + # frontend + frontend = get_frontend(args) + + output_dir = Path(args.output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + sentences = get_sentences(args) + + am_name = args.am[:args.am.rindex('_')] + am_dataset = args.am[args.am.rindex('_') + 1:] + fs = 24000 if am_dataset != 'ljspeech' else 22050 + + # am + am_sess = get_sess(args, filed='am') + + # vocoder + voc_sess = get_sess(args, filed='voc') + + # am warmup + for T in [27, 38, 54]: + data = np.random.randint(1, 266, size=(T, )) + am_sess.run(None, {"text": data}) + + # voc warmup + for T in [227, 308, 544]: + data = np.random.rand(T, 80).astype("float32") + voc_sess.run(None, {"logmel": data}) + print("warm up done!") + + # frontend warmup + # Loading model cost 0.5+ seconds + if args.lang == 'zh': + frontend.get_input_ids("你好,欢迎使用飞桨框架进行深度学习研究!", merge_sentences=True) + else: + print("lang should in be 'zh' here!") + + N = 0 + T = 0 + merge_sentences = True + for utt_id, sentence in sentences: + with timer() as t: + if args.lang == 'zh': + input_ids = frontend.get_input_ids( + sentence, merge_sentences=merge_sentences) + + phone_ids = input_ids["phone_ids"] + else: + print("lang should in be 'zh' here!") + # merge_sentences=True here, so we only use the first item of phone_ids + phone_ids = phone_ids[0].numpy() + mel = am_sess.run(output_names=None, input_feed={'text': phone_ids}) + mel = mel[0] + wav = voc_sess.run(output_names=None, input_feed={'logmel': mel}) + + N += len(wav[0]) + T += t.elapse + speed = len(wav[0]) / t.elapse + rtf = fs / speed + sf.write( + str(output_dir / (utt_id + ".wav")), + np.array(wav)[0], + samplerate=fs) + print( + f"{utt_id}, mel: {mel.shape}, wave: {len(wav[0])}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}." + ) + print(f"generation speed: {N / T}Hz, RTF: {fs / (N / T) }") + + +def parse_args(): + parser = argparse.ArgumentParser(description="Infernce with onnxruntime.") + # acoustic model + parser.add_argument( + '--am', + type=str, + default='fastspeech2_csmsc', + choices=[ + 'fastspeech2_csmsc', + ], + help='Choose acoustic model type of tts task.') + parser.add_argument( + "--phones_dict", type=str, default=None, help="phone vocabulary file.") + parser.add_argument( + "--tones_dict", type=str, default=None, help="tone vocabulary file.") + + # voc + parser.add_argument( + '--voc', + type=str, + default='hifigan_csmsc', + choices=['hifigan_csmsc', 'mb_melgan_csmsc'], + help='Choose vocoder type of tts task.') + # other + parser.add_argument( + "--inference_dir", type=str, help="dir to save inference models") + parser.add_argument( + "--text", + type=str, + help="text to synthesize, a 'utt_id sentence' pair per line") + parser.add_argument("--output_dir", type=str, help="output dir") + parser.add_argument( + '--lang', + type=str, + default='zh', + help='Choose model language. zh or en') + + # inference + parser.add_argument( + "--use_trt", + type=str2bool, + default=False, + help="Whether to use inference engin TensorRT.", ) + + parser.add_argument( + "--device", + default="gpu", + choices=["gpu", "cpu"], + help="Device selected for inference.", ) + parser.add_argument('--cpu_threads', type=int, default=1) + + args, _ = parser.parse_known_args() + return args + + +def main(): + args = parse_args() + + ort_predict(args) + + +if __name__ == "__main__": + main() diff --git a/paddlespeech/t2s/exps/speedyspeech/preprocess.py b/paddlespeech/t2s/exps/speedyspeech/preprocess.py index 3f81c4e14753d19e51db5d23f5e75440c67de34b..e833d13940530f293842a842b65f33cf6d03d9bd 100644 --- a/paddlespeech/t2s/exps/speedyspeech/preprocess.py +++ b/paddlespeech/t2s/exps/speedyspeech/preprocess.py @@ -79,6 +79,9 @@ def process_sentence(config: Dict[str, Any], logmel = mel_extractor.get_log_mel_fbank(wav) # change duration according to mel_length compare_duration_and_mel_length(sentences, utt_id, logmel) + # utt_id may be popped in compare_duration_and_mel_length + if utt_id not in sentences: + return None labels = sentences[utt_id][0] # extract phone and duration phones = [] diff --git a/paddlespeech/t2s/exps/synthesize_streaming.py b/paddlespeech/t2s/exps/synthesize_streaming.py index f38b2d3522f0c047ef6b3351f2db71130d2cebc4..7b9906c1076edda751c4a10772df0f82e8f1f39b 100644 --- a/paddlespeech/t2s/exps/synthesize_streaming.py +++ b/paddlespeech/t2s/exps/synthesize_streaming.py @@ -90,6 +90,7 @@ def evaluate(args): output_dir = Path(args.output_dir) output_dir.mkdir(parents=True, exist_ok=True) merge_sentences = True + get_tone_ids = False N = 0 T = 0 @@ -98,8 +99,6 @@ def evaluate(args): for utt_id, sentence in sentences: with timer() as t: - get_tone_ids = False - if args.lang == 'zh': input_ids = frontend.get_input_ids( sentence, diff --git a/paddlespeech/t2s/exps/tacotron2/preprocess.py b/paddlespeech/t2s/exps/tacotron2/preprocess.py index 7f41089ebf9d71b336d082b065e8b50c541f7edd..14a0d7eae227f5650a716bc656f3d0c32ee077e3 100644 --- a/paddlespeech/t2s/exps/tacotron2/preprocess.py +++ b/paddlespeech/t2s/exps/tacotron2/preprocess.py @@ -82,6 +82,9 @@ def process_sentence(config: Dict[str, Any], logmel = mel_extractor.get_log_mel_fbank(wav) # change duration according to mel_length compare_duration_and_mel_length(sentences, utt_id, logmel) + # utt_id may be popped in compare_duration_and_mel_length + if utt_id not in sentences: + return None phones = sentences[utt_id][0] durations = sentences[utt_id][1] num_frames = logmel.shape[0] diff --git a/paddlespeech/t2s/modules/positional_encoding.py b/paddlespeech/t2s/modules/positional_encoding.py index 7c368c3aa8a7520557b18d3bf0c09febd9151ddf..715c576f52ba586992b77e66b398f1a56e8a0fc7 100644 --- a/paddlespeech/t2s/modules/positional_encoding.py +++ b/paddlespeech/t2s/modules/positional_encoding.py @@ -31,8 +31,9 @@ def sinusoid_position_encoding(num_positions: int, channel = paddle.arange(0, feature_size, 2, dtype=dtype) index = paddle.arange(start_pos, start_pos + num_positions, 1, dtype=dtype) - p = (paddle.unsqueeze(index, -1) * - omega) / (10000.0**(channel / float(feature_size))) + denominator = channel / float(feature_size) + denominator = paddle.to_tensor([10000.0], dtype='float32')**denominator + p = (paddle.unsqueeze(index, -1) * omega) / denominator encodings = paddle.zeros([num_positions, feature_size], dtype=dtype) encodings[:, 0::2] = paddle.sin(p) encodings[:, 1::2] = paddle.cos(p) diff --git a/paddlespeech/vector/models/ecapa_tdnn.py b/paddlespeech/vector/models/ecapa_tdnn.py index 0e7287cd3614d8964941f6d14179e0ce7f3c4d71..895ff13f4509c7070d2473aebf8ce693a50dbcee 100644 --- a/paddlespeech/vector/models/ecapa_tdnn.py +++ b/paddlespeech/vector/models/ecapa_tdnn.py @@ -79,6 +79,20 @@ class Conv1d(nn.Layer): bias_attr=bias, ) def forward(self, x): + """Do conv1d forward + + Args: + x (paddle.Tensor): [N, C, L] input data, + N is the batch, + C is the data dimension, + L is the time + + Raises: + ValueError: only support the same padding type + + Returns: + paddle.Tensor: the value of conv1d + """ if self.padding == "same": x = self._manage_padding(x, self.kernel_size, self.dilation, self.stride) @@ -88,6 +102,20 @@ class Conv1d(nn.Layer): return self.conv(x) def _manage_padding(self, x, kernel_size: int, dilation: int, stride: int): + """Padding the input data + + Args: + x (paddle.Tensor): [N, C, L] input data + N is the batch, + C is the data dimension, + L is the time + kernel_size (int): 1-d convolution kernel size + dilation (int): 1-d convolution dilation + stride (int): 1-d convolution stride + + Returns: + paddle.Tensor: the padded input data + """ L_in = x.shape[-1] # Detecting input shape padding = self._get_padding_elem(L_in, stride, kernel_size, dilation) # Time padding @@ -101,6 +129,17 @@ class Conv1d(nn.Layer): stride: int, kernel_size: int, dilation: int): + """Calculate the padding value in same mode + + Args: + L_in (int): the times of the input data, + stride (int): 1-d convolution stride + kernel_size (int): 1-d convolution kernel size + dilation (int): 1-d convolution stride + + Returns: + int: return the padding value in same mode + """ if stride > 1: n_steps = math.ceil(((L_in - kernel_size * dilation) / stride) + 1) L_out = stride * (n_steps - 1) + kernel_size * dilation @@ -245,6 +284,13 @@ class SEBlock(nn.Layer): class AttentiveStatisticsPooling(nn.Layer): def __init__(self, channels, attention_channels=128, global_context=True): + """Compute the speaker verification statistics + The detail info is section 3.1 in https://arxiv.org/pdf/1709.01507.pdf + Args: + channels (int): input data channel or data dimension + attention_channels (int, optional): attention dimension. Defaults to 128. + global_context (bool, optional): If use the global context information. Defaults to True. + """ super().__init__() self.eps = 1e-12