diff --git a/README.md b/README.md index 3fdb2945f55eb44db902a2b97c1b7f4382892b59..e93aa1d9ccbe37410b826bd78c586deff36874d8 100644 --- a/README.md +++ b/README.md @@ -25,14 +25,16 @@ | Documents | Models List | AIStudio Courses - | Paper + | NAACL2022 Best Demo Award Paper | Gitee ------------------------------------------------------------------------------------ -**PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models. +**PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models. + +**PaddleSpeech** won the [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/), please check out our paper on [Arxiv](https://arxiv.org/abs/2205.12007). ##### Speech Recognition @@ -177,7 +179,7 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision ## Installation -We strongly recommend our users to install PaddleSpeech in **Linux** with *python>=3.7*. +We strongly recommend our users to install PaddleSpeech in **Linux** with *python>=3.7* and *paddlepaddle>=2.3.1*. Up to now, **Linux** supports CLI for the all our tasks, **Mac OSX** and **Windows** only supports PaddleSpeech CLI for Audio Classification, Speech-to-Text and Text-to-Speech. To install `PaddleSpeech`, please see [installation](./docs/source/install.md). @@ -706,6 +708,7 @@ You are warmly welcome to submit questions in [discussions](https://github.com/P - Many thanks to [phecda-xu](https://github.com/phecda-xu)/[PaddleDubbing](https://github.com/phecda-xu/PaddleDubbing) for developing a dubbing tool with GUI based on PaddleSpeech TTS model. - Many thanks to [jerryuhoo](https://github.com/jerryuhoo)/[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk) for developing a GUI tool based on PaddleSpeech TTS and code for making datasets from videos based on PaddleSpeech ASR. - Many thanks to [vpegasus](https://github.com/vpegasus)/[xuesebot](https://github.com/vpegasus/xuesebot) for developing a rasa chatbot,which is able to speak and listen thanks to PaddleSpeech. +- Many thanks to [chenkui164](https://github.com/chenkui164)/[FastASR](https://github.com/chenkui164/FastASR) for the C++ inference implementation of PaddleSpeech ASR. Besides, PaddleSpeech depends on a lot of open source repositories. See [references](./docs/source/reference.md) for more information. diff --git a/README_cn.md b/README_cn.md index 91a01d71047c1aa38cd8c9d059b39e9ca5245d7a..8f149d63ecf7bbf9774c308d50055d6f484218f0 100644 --- a/README_cn.md +++ b/README_cn.md @@ -26,7 +26,7 @@ | 教程文档 | 模型列表 | AIStudio 课程 - | 论文 + | NAACL2022 论文 | Gitee @@ -35,6 +35,9 @@ ------------------------------------------------------------------------------------ **PaddleSpeech** 是基于飞桨 [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) 的语音方向的开源模型库,用于语音和音频中的各种关键任务的开发,包含大量基于深度学习前沿和有影响力的模型,一些典型的应用示例如下: + +**PaddleSpeech** 荣获 [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/), 请访问 [Arxiv](https://arxiv.org/abs/2205.12007) 论文。 + ##### 语音识别
@@ -702,7 +705,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声 - 非常感谢 [jerryuhoo](https://github.com/jerryuhoo)/[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk) 基于 PaddleSpeech 的 TTS GUI 界面和基于 ASR 制作数据集的相关代码。 - 非常感谢 [vpegasus](https://github.com/vpegasus)/[xuesebot](https://github.com/vpegasus/xuesebot) 基于 PaddleSpeech 的 ASR 与 TTS 设计的可听、说对话机器人。 - +- 非常感谢 [chenkui164](https://github.com/chenkui164)/[FastASR](https://github.com/chenkui164/FastASR) 对 PaddleSpeech 的 ASR 进行 C++ 推理实现。 此外,PaddleSpeech 依赖于许多开源存储库。有关更多信息,请参阅 [references](./docs/source/reference.md)。 diff --git a/demos/README.md b/demos/README.md index 2a306df6b1e1b3648b7306506adc01d5e0ffcdf2..72b70b237944516ad53048212ecdb3504342e359 100644 --- a/demos/README.md +++ b/demos/README.md @@ -12,6 +12,7 @@ This directory contains many speech applications in multiple scenarios. * speech recognition - recognize text of an audio file * speech server - Server for Speech Task, e.g. ASR,TTS,CLS * streaming asr server - receive audio stream from websocket, and recognize to transcript. +* streaming tts server - receive text from http or websocket, and streaming audio data stream. * speech translation - end to end speech translation * story talker - book reader based on OCR and TTS * style_fs2 - multi style control for FastSpeech2 model diff --git a/demos/README_cn.md b/demos/README_cn.md index 471342127f4e6e49522714d5926f5c185fbdb92b..04fc1fa7d4a4fa5c4df6229dc92b4df9ad6ceda5 100644 --- a/demos/README_cn.md +++ b/demos/README_cn.md @@ -10,8 +10,9 @@ * 元宇宙 - 基于语音合成的 2D 增强现实。 * 标点恢复 - 通常作为语音识别的文本后处理任务,为一段无标点的纯文本添加相应的标点符号。 * 语音识别 - 识别一段音频中包含的语音文字。 -* 语音服务 - 离线语音服务,包括ASR、TTS、CLS等 -* 流式语音识别服务 - 流式输入语音数据流识别音频中的文字 +* 语音服务 - 离线语音服务,包括ASR、TTS、CLS等。 +* 流式语音识别服务 - 流式输入语音数据流识别音频中的文字。 +* 流式语音合成服务 - 根据待合成文本流式生成合成音频数据流。 * 语音翻译 - 实时识别音频中的语言,并同时翻译成目标语言。 * 会说话的故事书 - 基于 OCR 和语音合成的会说话的故事书。 * 个性化语音合成 - 基于 FastSpeech2 模型的个性化语音合成。 diff --git a/demos/custom_streaming_asr/setup_docker.sh b/demos/custom_streaming_asr/setup_docker.sh old mode 100644 new mode 100755 diff --git a/demos/keyword_spotting/run.sh b/demos/keyword_spotting/run.sh old mode 100644 new mode 100755 diff --git a/demos/speaker_verification/run.sh b/demos/speaker_verification/run.sh old mode 100644 new mode 100755 diff --git a/demos/speech_recognition/run.sh b/demos/speech_recognition/run.sh old mode 100644 new mode 100755 index 19ce0ebb3f0c24cf4adf35dea3bf08232ebe429d..e48ff3e9667d41ec4025e9ef30e55257a6af34a5 --- a/demos/speech_recognition/run.sh +++ b/demos/speech_recognition/run.sh @@ -1,6 +1,7 @@ #!/bin/bash -wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav +wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav +wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav # asr paddlespeech asr --input ./zh.wav @@ -8,3 +9,18 @@ paddlespeech asr --input ./zh.wav # asr + punc paddlespeech asr --input ./zh.wav | paddlespeech text --task punc + + +# asr help +paddlespeech asr --help + + +# english asr +paddlespeech asr --lang en --model transformer_librispeech --input ./en.wav + +# model stats +paddlespeech stats --task asr + + +# paddlespeech help +paddlespeech --help diff --git a/demos/speech_server/README.md b/demos/speech_server/README.md index 14a88f0780396101583244a0e7e9d39681c159fd..34f927fa49cfb3fe15ce1476392494ce61002a7f 100644 --- a/demos/speech_server/README.md +++ b/demos/speech_server/README.md @@ -11,7 +11,8 @@ This demo is an implementation of starting the voice service and accessing the s see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). It is recommended to use **paddlepaddle 2.2.2** or above. -You can choose one way from meduim and hard to install paddlespeech. +You can choose one way from easy, meduim and hard to install paddlespeech. +**If you install in simple mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.** ### 2. Prepare config File The configuration file can be found in `conf/application.yaml` . diff --git a/demos/speech_server/README_cn.md b/demos/speech_server/README_cn.md index 29629b7e892a3355867c0c0f8651e7d897644621..bcbb2cbe2764f1fef2c6d925620ee337b7606716 100644 --- a/demos/speech_server/README_cn.md +++ b/demos/speech_server/README_cn.md @@ -3,7 +3,7 @@ # 语音服务 ## 介绍 -这个 demo 是一个启动离线语音服务和访问服务的实现。它可以通过使用`paddlespeech_server` 和 `paddlespeech_client`的单个命令或 python 的几行代码来实现。 +这个 demo 是一个启动离线语音服务和访问服务的实现。它可以通过使用`paddlespeech_server` 和 `paddlespeech_client` 的单个命令或 python 的几行代码来实现。 ## 使用方法 @@ -11,13 +11,14 @@ 请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). 推荐使用 **paddlepaddle 2.2.2** 或以上版本。 -你可以从 medium,hard 两种方式中选择一种方式安装 PaddleSpeech。 +你可以从简单,中等,困难几种方式中选择一种方式安装 PaddleSpeech。 +**如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。** ### 2. 准备配置文件 配置文件可参见 `conf/application.yaml` 。 其中,`engine_list`表示即将启动的服务将会包含的语音引擎,格式为 <语音任务>_<引擎类型>。 -目前服务集成的语音任务有: asr(语音识别)、tts(语音合成)、cls(音频分类)、vector(声纹识别)以及text(文本处理)。 +目前服务集成的语音任务有: asr(语音识别)、tts(语音合成)、cls(音频分类)、vector(声纹识别)以及 text(文本处理)。 目前引擎类型支持两种形式:python 及 inference (Paddle Inference) **注意:** 如果在容器里可正常启动服务,但客户端访问 ip 不可达,可尝试将配置文件中 `host` 地址换成本地 ip 地址。 diff --git a/demos/speech_server/asr_client.sh b/demos/speech_server/asr_client.sh old mode 100644 new mode 100755 diff --git a/demos/speech_server/cls_client.sh b/demos/speech_server/cls_client.sh old mode 100644 new mode 100755 diff --git a/demos/speech_server/server.sh b/demos/speech_server/server.sh old mode 100644 new mode 100755 index e5961286ba459ba6bdf8ee46de4e512cc7e25640..fd719ffc101ae1328330d6ae88fc8b8fefd210f1 --- a/demos/speech_server/server.sh +++ b/demos/speech_server/server.sh @@ -1,3 +1,3 @@ #!/bin/bash -paddlespeech_server start --config_file ./conf/application.yaml +paddlespeech_server start --config_file ./conf/application.yaml &> server.log & diff --git a/demos/speech_server/sid_client.sh b/demos/speech_server/sid_client.sh new file mode 100755 index 0000000000000000000000000000000000000000..99bab21ae25d349b41e36f7b762dcdaabac01379 --- /dev/null +++ b/demos/speech_server/sid_client.sh @@ -0,0 +1,10 @@ +#!/bin/bash + +wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav +wget -c https://paddlespeech.bj.bcebos.com/vector/audio/123456789.wav + +# sid extract +paddlespeech_client vector --server_ip 127.0.0.1 --port 8090 --task spk --input ./85236145389.wav + +# sid score +paddlespeech_client vector --server_ip 127.0.0.1 --port 8090 --task score --enroll ./85236145389.wav --test ./123456789.wav diff --git a/demos/speech_server/text_client.sh b/demos/speech_server/text_client.sh new file mode 100755 index 0000000000000000000000000000000000000000..098f159fb4bfbd16d48c290f5050c949fc1f4369 --- /dev/null +++ b/demos/speech_server/text_client.sh @@ -0,0 +1,4 @@ +#!/bin/bash + + +paddlespeech_client text --server_ip 127.0.0.1 --port 8090 --input 今天的天气真好啊你下午有空吗我想约你一起去吃饭 diff --git a/demos/speech_server/tts_client.sh b/demos/speech_server/tts_client.sh old mode 100644 new mode 100755 diff --git a/demos/speech_web/web_client/package-lock.json b/demos/speech_web/web_client/package-lock.json index f1c779782307c44e2b33807545e82d0340ec222b..509be385ce31f1070ce5131dccf7af28f43c7067 100644 --- a/demos/speech_web/web_client/package-lock.json +++ b/demos/speech_web/web_client/package-lock.json @@ -747,9 +747,9 @@ } }, "node_modules/moment": { - "version": "2.29.3", - "resolved": "https://registry.npmmirror.com/moment/-/moment-2.29.3.tgz", - "integrity": "sha512-c6YRvhEo//6T2Jz/vVtYzqBzwvPT95JBQ+smCytzf7c50oMZRsR/a4w88aD34I+/QVSfnoAnSBFPJHItlOMJVw==", + "version": "2.29.4", + "resolved": "https://registry.npmjs.org/moment/-/moment-2.29.4.tgz", + "integrity": "sha512-5LC9SOxjSc2HF6vO2CyuTDNivEdoz2IvyJJGj6X8DJ0eFyfszE0QiEd+iXmBvUP3WHxSjFH/vIsA0EN00cgr8w==", "engines": { "node": "*" } @@ -1636,9 +1636,9 @@ "optional": true }, "moment": { - "version": "2.29.3", - "resolved": "https://registry.npmmirror.com/moment/-/moment-2.29.3.tgz", - "integrity": "sha512-c6YRvhEo//6T2Jz/vVtYzqBzwvPT95JBQ+smCytzf7c50oMZRsR/a4w88aD34I+/QVSfnoAnSBFPJHItlOMJVw==" + "version": "2.29.4", + "resolved": "https://registry.npmjs.org/moment/-/moment-2.29.4.tgz", + "integrity": "sha512-5LC9SOxjSc2HF6vO2CyuTDNivEdoz2IvyJJGj6X8DJ0eFyfszE0QiEd+iXmBvUP3WHxSjFH/vIsA0EN00cgr8w==" }, "nanoid": { "version": "3.3.2", diff --git a/demos/speech_web/web_client/yarn.lock b/demos/speech_web/web_client/yarn.lock index 4504eab36a3b74e4a482e7f96c7b93c3457385a0..6777cf4ce7623a712ace6d1df68044b5ca09635c 100644 --- a/demos/speech_web/web_client/yarn.lock +++ b/demos/speech_web/web_client/yarn.lock @@ -587,9 +587,9 @@ mime@^1.4.1: integrity sha512-x0Vn8spI+wuJ1O6S7gnbaQg8Pxh4NNHb7KSINmEWKiPE4RKOplvijn+NkmYmmRgP68mc70j2EbeTFRsrswaQeg== moment@^2.27.0: - version "2.29.3" - resolved "https://registry.npmmirror.com/moment/-/moment-2.29.3.tgz" - integrity sha512-c6YRvhEo//6T2Jz/vVtYzqBzwvPT95JBQ+smCytzf7c50oMZRsR/a4w88aD34I+/QVSfnoAnSBFPJHItlOMJVw== + version "2.29.4" + resolved "https://registry.yarnpkg.com/moment/-/moment-2.29.4.tgz#3dbe052889fe7c1b2ed966fcb3a77328964ef108" + integrity sha512-5LC9SOxjSc2HF6vO2CyuTDNivEdoz2IvyJJGj6X8DJ0eFyfszE0QiEd+iXmBvUP3WHxSjFH/vIsA0EN00cgr8w== ms@^2.1.1: version "2.1.3" diff --git a/demos/streaming_asr_server/README.md b/demos/streaming_asr_server/README.md index a770f58c3c394ca897920c82aa5bab735f61d75d..773d4957c0f8504a443b26f1ad2e21622e2dd1f7 100644 --- a/demos/streaming_asr_server/README.md +++ b/demos/streaming_asr_server/README.md @@ -12,7 +12,8 @@ Streaming ASR server only support `websocket` protocol, and doesn't support `htt see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). It is recommended to use **paddlepaddle 2.2.1** or above. -You can choose one way from meduim and hard to install paddlespeech. +You can choose one way from easy, meduim and hard to install paddlespeech. +**If you install in simple mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.** ### 2. Prepare config File The configuration file can be found in `conf/ws_application.yaml` 和 `conf/ws_conformer_wenetspeech_application.yaml`. diff --git a/demos/streaming_asr_server/README_cn.md b/demos/streaming_asr_server/README_cn.md index c771869e95caa9a027dba93ca1b810dfa5d5947b..f9fd536ea706823e3269d7541acd1f588d4266b8 100644 --- a/demos/streaming_asr_server/README_cn.md +++ b/demos/streaming_asr_server/README_cn.md @@ -12,8 +12,8 @@ 安装 PaddleSpeech 的详细过程请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md)。 推荐使用 **paddlepaddle 2.2.1** 或以上版本。 -你可以从medium,hard 两种方式中选择一种方式安装 PaddleSpeech。 - +你可以从简单, 中等,困难 几种种方式中选择一种方式安装 PaddleSpeech。 +**如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。** ### 2. 准备配置文件 diff --git a/demos/streaming_asr_server/punc_server.py b/demos/streaming_asr_server/local/punc_server.py similarity index 100% rename from demos/streaming_asr_server/punc_server.py rename to demos/streaming_asr_server/local/punc_server.py diff --git a/demos/streaming_asr_server/streaming_asr_server.py b/demos/streaming_asr_server/local/streaming_asr_server.py similarity index 100% rename from demos/streaming_asr_server/streaming_asr_server.py rename to demos/streaming_asr_server/local/streaming_asr_server.py diff --git a/demos/streaming_asr_server/run.sh b/demos/streaming_asr_server/run.sh old mode 100644 new mode 100755 diff --git a/demos/streaming_asr_server/server.sh b/demos/streaming_asr_server/server.sh index f532546e7c2280cdf9dab0f0e09765031673edf0..961cb046ab37c244c183be5a2647ef1812103e11 100755 --- a/demos/streaming_asr_server/server.sh +++ b/demos/streaming_asr_server/server.sh @@ -1,9 +1,8 @@ -export CUDA_VISIBLE_DEVICE=0,1,2,3 - export CUDA_VISIBLE_DEVICE=0,1,2,3 +#export CUDA_VISIBLE_DEVICE=0,1,2,3 -# nohup python3 punc_server.py --config_file conf/punc_application.yaml > punc.log 2>&1 & +# nohup python3 local/punc_server.py --config_file conf/punc_application.yaml > punc.log 2>&1 & paddlespeech_server start --config_file conf/punc_application.yaml &> punc.log & -# nohup python3 streaming_asr_server.py --config_file conf/ws_conformer_wenetspeech_application.yaml > streaming_asr.log 2>&1 & +# nohup python3 local/streaming_asr_server.py --config_file conf/ws_conformer_wenetspeech_application.yaml > streaming_asr.log 2>&1 & paddlespeech_server start --config_file conf/ws_conformer_wenetspeech_application.yaml &> streaming_asr.log & diff --git a/demos/streaming_asr_server/test.sh b/demos/streaming_asr_server/test.sh index 67a5ec4c5023fc49d4585b608cabb87d88255811..386c7f89436181be9a168a97426cdf8836309418 100755 --- a/demos/streaming_asr_server/test.sh +++ b/demos/streaming_asr_server/test.sh @@ -7,5 +7,5 @@ paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input ./zh.wa # read the wav and call streaming and punc service # If `127.0.0.1` is not accessible, you need to use the actual service IP address. -paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8290 --punc.server_ip 127.0.0.1 --punc.port 8190 --input ./zh.wav +paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --punc.server_ip 127.0.0.1 --punc.port 8190 --input ./zh.wav diff --git a/demos/streaming_tts_server/README.md b/demos/streaming_tts_server/README.md index cbea6bf774140f16fa30b09b0c30cd798782c5fa..645bde86a1df8d9398dbd33312166fdf5a61fa97 100644 --- a/demos/streaming_tts_server/README.md +++ b/demos/streaming_tts_server/README.md @@ -11,7 +11,8 @@ This demo is an implementation of starting the streaming speech synthesis servic see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). It is recommended to use **paddlepaddle 2.2.2** or above. -You can choose one way from meduim and hard to install paddlespeech. +You can choose one way from easy, meduim and hard to install paddlespeech. +**If you install in simple mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.** ### 2. Prepare config File diff --git a/demos/streaming_tts_server/README_cn.md b/demos/streaming_tts_server/README_cn.md index 3cd2817096e41d0a5d8b808df34ceb50767206ae..c88e3196a8eab86598fa274771f67afaf4cd3931 100644 --- a/demos/streaming_tts_server/README_cn.md +++ b/demos/streaming_tts_server/README_cn.md @@ -11,8 +11,8 @@ 请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). 推荐使用 **paddlepaddle 2.2.2** 或以上版本。 -你可以从 medium,hard 两种方式中选择一种方式安装 PaddleSpeech。 - +你可以从简单,中等,困难几种方式中选择一种方式安装 PaddleSpeech。 +**如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。** ### 2. 准备配置文件 配置文件可参见 `conf/tts_online_application.yaml` 。 diff --git a/demos/streaming_tts_server/test_client.sh b/demos/streaming_tts_server/client.sh old mode 100644 new mode 100755 similarity index 61% rename from demos/streaming_tts_server/test_client.sh rename to demos/streaming_tts_server/client.sh index bd88f20b1bce760437c5c2bf7ffa85624b5490e2..e93da58a8c8206fd2524f50023807d8341eb7125 --- a/demos/streaming_tts_server/test_client.sh +++ b/demos/streaming_tts_server/client.sh @@ -2,8 +2,8 @@ # http client test # If `127.0.0.1` is not accessible, you need to use the actual service IP address. -paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav +paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.http.wav # websocket client test # If `127.0.0.1` is not accessible, you need to use the actual service IP address. -# paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol websocket --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav +paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8192 --protocol websocket --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.ws.wav diff --git a/demos/streaming_tts_server/conf/tts_online_ws_application.yaml b/demos/streaming_tts_server/conf/tts_online_ws_application.yaml new file mode 100644 index 0000000000000000000000000000000000000000..146f06f157354849e372cb302793b56b9ffa8acc --- /dev/null +++ b/demos/streaming_tts_server/conf/tts_online_ws_application.yaml @@ -0,0 +1,103 @@ +# This is the parameter configuration file for streaming tts server. + +################################################################################# +# SERVER SETTING # +################################################################################# +host: 0.0.0.0 +port: 8192 + +# The task format in the engin_list is: _ +# engine_list choices = ['tts_online', 'tts_online-onnx'], the inference speed of tts_online-onnx is faster than tts_online. +# protocol choices = ['websocket', 'http'] +protocol: 'websocket' +engine_list: ['tts_online-onnx'] + + +################################################################################# +# ENGINE CONFIG # +################################################################################# + +################################### TTS ######################################### +################### speech task: tts; engine_type: online ####################### +tts_online: + # am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc'] + # fastspeech2_cnndecoder_csmsc support streaming am infer. + am: 'fastspeech2_csmsc' + am_config: + am_ckpt: + am_stat: + phones_dict: + tones_dict: + speaker_dict: + spk_id: 0 + + # voc (vocoder) choices=['mb_melgan_csmsc, hifigan_csmsc'] + # Both mb_melgan_csmsc and hifigan_csmsc support streaming voc inference + voc: 'mb_melgan_csmsc' + voc_config: + voc_ckpt: + voc_stat: + + # others + lang: 'zh' + device: 'cpu' # set 'gpu:id' or 'cpu' + # am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer, + # when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio + am_block: 72 + am_pad: 12 + # voc_pad and voc_block voc model to streaming voc infer, + # when voc model is mb_melgan_csmsc, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal + # when voc model is hifigan_csmsc, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal + voc_block: 36 + voc_pad: 14 + + + +################################################################################# +# ENGINE CONFIG # +################################################################################# + +################################### TTS ######################################### +################### speech task: tts; engine_type: online-onnx ####################### +tts_online-onnx: + # am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx'] + # fastspeech2_cnndecoder_csmsc_onnx support streaming am infer. + am: 'fastspeech2_cnndecoder_csmsc_onnx' + # am_ckpt is a list, if am is fastspeech2_cnndecoder_csmsc_onnx, am_ckpt = [encoder model, decoder model, postnet model]; + # if am is fastspeech2_csmsc_onnx, am_ckpt = [ckpt model]; + am_ckpt: # list + am_stat: + phones_dict: + tones_dict: + speaker_dict: + spk_id: 0 + am_sample_rate: 24000 + am_sess_conf: + device: "cpu" # set 'gpu:id' or 'cpu' + use_trt: False + cpu_threads: 4 + + # voc (vocoder) choices=['mb_melgan_csmsc_onnx, hifigan_csmsc_onnx'] + # Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference + voc: 'hifigan_csmsc_onnx' + voc_ckpt: + voc_sample_rate: 24000 + voc_sess_conf: + device: "cpu" # set 'gpu:id' or 'cpu' + use_trt: False + cpu_threads: 4 + + # others + lang: 'zh' + # am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer, + # when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio + am_block: 72 + am_pad: 12 + # voc_pad and voc_block voc model to streaming voc infer, + # when voc model is mb_melgan_csmsc_onnx, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal + # when voc model is hifigan_csmsc_onnx, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal + voc_block: 36 + voc_pad: 14 + # voc_upsample should be same as n_shift on voc config. + voc_upsample: 300 + diff --git a/demos/streaming_tts_server/server.sh b/demos/streaming_tts_server/server.sh new file mode 100755 index 0000000000000000000000000000000000000000..d34ddba02c7be50929db82af409980e8ecde7643 --- /dev/null +++ b/demos/streaming_tts_server/server.sh @@ -0,0 +1,10 @@ +#!/bin/bash + +# http server +paddlespeech_server start --config_file ./conf/tts_online_application.yaml &> tts.http.log & + + +# websocket server +paddlespeech_server start --config_file ./conf/tts_online_ws_application.yaml &> tts.ws.log & + + diff --git a/demos/streaming_tts_server/start_server.sh b/demos/streaming_tts_server/start_server.sh deleted file mode 100644 index 9c71f2fe22d01656553a56ab2d9b628bfdb54a2c..0000000000000000000000000000000000000000 --- a/demos/streaming_tts_server/start_server.sh +++ /dev/null @@ -1,3 +0,0 @@ -#!/bin/bash -# start server -paddlespeech_server start --config_file ./conf/tts_online_application.yaml \ No newline at end of file diff --git a/demos/text_to_speech/run.sh b/demos/text_to_speech/run.sh index b1340241bf833129de9ae5581ada4a542253f96c..2b588be554a503097f061acae60cdecab3373b26 100755 --- a/demos/text_to_speech/run.sh +++ b/demos/text_to_speech/run.sh @@ -4,4 +4,10 @@ paddlespeech tts --input 今天的天气不错啊 # Batch process -echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts \ No newline at end of file +echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts + +# Text Frontend +paddlespeech tts --input 今天是2022/10/29,最低温度是-3℃. + + + diff --git a/docker/ubuntu16-gpu/Dockerfile b/docker/ubuntu16-gpu/Dockerfile new file mode 100644 index 0000000000000000000000000000000000000000..f275471ee48a5a392db8b19fc7d0d224eaafc1c0 --- /dev/null +++ b/docker/ubuntu16-gpu/Dockerfile @@ -0,0 +1,77 @@ +FROM nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu16.04 + +RUN echo "deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial main restricted \n\ +deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-updates main restricted \n\ +deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial universe \n\ +deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-updates universe \n\ +deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial multiverse \n\ +deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-updates multiverse \n\ +deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-backports main restricted universe multiverse \n\ +deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-security main restricted \n\ +deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-security universe \n\ +deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-security multiverse" > /etc/apt/sources.list + +RUN apt-get update && apt-get install -y inetutils-ping wget vim curl cmake git sox libsndfile1 libpng12-dev \ + libpng-dev swig libzip-dev openssl bc libflac* libgdk-pixbuf2.0-dev libpango1.0-dev libcairo2-dev \ + libgtk2.0-dev pkg-config zip unzip zlib1g-dev libreadline-dev libbz2-dev liblapack-dev libjpeg-turbo8-dev \ + sudo lrzsz libsqlite3-dev libx11-dev libsm6 apt-utils libopencv-dev libavcodec-dev libavformat-dev \ + libswscale-dev locales liblzma-dev python-lzma m4 libxext-dev strace libibverbs-dev libpcre3 libpcre3-dev \ + build-essential libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev xz-utils \ + libfreetype6-dev libxslt1-dev libxml2-dev libgeos-3.5.0 libgeos-dev && apt-get install -y --allow-downgrades \ + --allow-change-held-packages libnccl2 libnccl-dev && DEBIAN_FRONTEND=noninteractive apt-get install -y tzdata \ + && /bin/cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && dpkg-reconfigure -f noninteractive tzdata && \ + cd /usr/lib/x86_64-linux-gnu && ln -s libcudnn.so.8 libcudnn.so && \ + cd /usr/local/cuda-11.2/targets/x86_64-linux/lib && ln -s libcublas.so.11.4.1.1043 libcublas.so && \ + ln -s libcusolver.so.11.1.0.152 libcusolver.so && ln -s libcusparse.so.11 libcusparse.so && \ + ln -s libcufft.so.10.4.1.152 libcufft.so + +RUN echo "set meta-flag on" >> /etc/inputrc && echo "set convert-meta off" >> /etc/inputrc && \ + locale-gen en_US.UTF-8 && /sbin/ldconfig -v && groupadd -g 10001 paddle && \ + useradd -m -s /bin/bash -N -u 10001 paddle -g paddle && chmod g+w /etc/passwd && \ + echo "paddle ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers + +ENV LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LANGUAGE=en_US.UTF-8 TZ=Asia/Shanghai + +# official download site: https://www.python.org/ftp/python/3.7.13/Python-3.7.13.tgz +RUN wget https://cdn.npmmirror.com/binaries/python/3.7.13/Python-3.7.13.tgz && tar xvf Python-3.7.13.tgz && \ + cd Python-3.7.13 && ./configure --prefix=/home/paddle/python3.7 && make -j8 && make install && \ + rm -rf ../Python-3.7.13 ../Python-3.7.13.tgz && chown -R paddle:paddle /home/paddle/python3.7 + +RUN cd /tmp && wget https://mirrors.sjtug.sjtu.edu.cn/gnu/gmp/gmp-6.1.0.tar.bz2 && tar xvf gmp-6.1.0.tar.bz2 && \ + cd gmp-6.1.0 && ./configure --prefix=/usr/local && make -j8 && make install && \ + rm -rf ../gmp-6.1.0.tar.bz2 ../gmp-6.1.0 && cd /tmp && \ + wget https://www.mpfr.org/mpfr-3.1.4/mpfr-3.1.4.tar.bz2 && tar xvf mpfr-3.1.4.tar.bz2 && cd mpfr-3.1.4 && \ + ./configure --prefix=/usr/local && make -j8 && make install && rm -rf ../mpfr-3.1.4.tar.bz2 ../mpfr-3.1.4 && \ + cd /tmp && wget https://mirrors.sjtug.sjtu.edu.cn/gnu/mpc/mpc-1.0.3.tar.gz && tar xvf mpc-1.0.3.tar.gz && \ + cd mpc-1.0.3 && ./configure --prefix=/usr/local && make -j8 && make install && \ + rm -rf ../mpc-1.0.3.tar.gz ../mpc-1.0.3 && cd /tmp && \ + wget http://www.mirrorservice.org/sites/sourceware.org/pub/gcc/infrastructure/isl-0.18.tar.bz2 && \ + tar xvf isl-0.18.tar.bz2 && cd isl-0.18 && ./configure --prefix=/usr/local && make -j8 && make install \ + && rm -rf ../isl-0.18.tar.bz2 ../isl-0.18 && cd /tmp && \ + wget http://mirrors.ustc.edu.cn/gnu/gcc/gcc-8.2.0/gcc-8.2.0.tar.gz --no-check-certificate && \ + tar xvf gcc-8.2.0.tar.gz && cd gcc-8.2.0 && unset LIBRARY_PATH && ./configure --prefix=/home/paddle/gcc82 \ + --enable-threads=posix --disable-checking --disable-multilib --enable-languages=c,c++ --with-gmp=/usr/local \ + --with-mpfr=/usr/local --with-mpc=/usr/local --with-isl=/usr/local && make -j8 && make install && \ + rm -rf ../gcc-8.2.0.tar.gz ../gcc-8.2.0 && chown -R paddle:paddle /home/paddle/gcc82 + +WORKDIR /home/paddle +ENV PATH=/home/paddle/python3.7/bin:/home/paddle/gcc82/bin:${PATH} \ + LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda-11.2/targets/x86_64-linux/lib:${LD_LIBRARY_PATH} + +RUN mkdir -p ~/.pip && echo "[global]" > ~/.pip/pip.conf && \ + echo "index-url=https://mirror.baidu.com/pypi/simple" >> ~/.pip/pip.conf && \ + echo "trusted-host=mirror.baidu.com" >> ~/.pip/pip.conf && \ + python3 -m pip install --upgrade pip && \ + pip install paddlepaddle-gpu==2.3.1.post112 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html && \ + rm -rf ~/.cache/pip + +RUN git clone https://github.com/PaddlePaddle/PaddleSpeech.git && cd PaddleSpeech && \ + pip3 install pytest-runner paddleaudio -i https://pypi.tuna.tsinghua.edu.cn/simple && \ + pip3 install -e .[develop] -i https://pypi.tuna.tsinghua.edu.cn/simple && \ + pip3 install importlib-metadata==4.2.0 urllib3==1.25.10 -i https://pypi.tuna.tsinghua.edu.cn/simple && \ + rm -rf ~/.cache/pip && \ + sudo cp -f /home/paddle/gcc82/lib64/libstdc++.so.6.0.25 /usr/lib/x86_64-linux-gnu/libstdc++.so.6 && \ + chown -R paddle:paddle /home/paddle/PaddleSpeech + +USER paddle +CMD ['bash'] diff --git a/docs/source/install.md b/docs/source/install.md index e8bd5adf91df09d68385ad50cbaa8aa90f4ced63..6a9ff3bc83c4a85061eb60ad6379dbcc8303faaf 100644 --- a/docs/source/install.md +++ b/docs/source/install.md @@ -10,7 +10,7 @@ There are 3 ways to use `PaddleSpeech`. According to the degree of difficulty, t ## Prerequisites - Python >= 3.7 -- PaddlePaddle latest version (please refer to the [Installation Guide] (https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/index_en.html)) +- PaddlePaddle latest version (please refer to the [Installation Guide](https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/index_en.html)) - C++ compilation environment - Hip: For Linux and Mac, do not use command `sh` instead of command `bash` in installation document. - Hip: We recommand you to install `paddlepaddle` from https://mirror.baidu.com/pypi/simple and install `paddlespeech` from https://pypi.tuna.tsinghua.edu.cn/simple. @@ -117,9 +117,9 @@ conda install -y -c gcc_linux-64=8.4.0 gxx_linux-64=8.4.0 ``` (Hip: Do not use the last script if you want to install by **Hard** way): ### Install PaddlePaddle -You can choose the `PaddlePaddle` version based on your system. For example, for CUDA 10.2, CuDNN7.5 install paddlepaddle-gpu 2.2.0: +You can choose the `PaddlePaddle` version based on your system. For example, for CUDA 10.2, CuDNN7.5 install paddlepaddle-gpu 2.3.1: ```bash -python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple +python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple ``` ### Install PaddleSpeech You can install `paddlespeech` by the following command,then you can use the `ready-made` examples in `paddlespeech` : @@ -180,9 +180,9 @@ Some users may fail to install `kaldiio` due to the default download source, you ```bash pip install pytest-runner -i https://pypi.tuna.tsinghua.edu.cn/simple ``` -Make sure you have GPU and the paddlepaddle version is right. For example, for CUDA 10.2, CuDNN7.5 install paddle 2.2.0: +Make sure you have GPU and the paddlepaddle version is right. For example, for CUDA 10.2, CuDNN7.5 install paddle 2.3.1: ```bash -python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple +python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple ``` ### Install PaddleSpeech in Developing Mode ```bash diff --git a/docs/source/install_cn.md b/docs/source/install_cn.md index 75f4174e06ee285f5e6ef85037402ccc5d76512a..9f49ebad637e9d59957c7ad25e028d7977b04616 100644 --- a/docs/source/install_cn.md +++ b/docs/source/install_cn.md @@ -111,9 +111,9 @@ conda install -y -c gcc_linux-64=8.4.0 gxx_linux-64=8.4.0 ``` (提示: 如果你想使用**困难**方式完成安装,请不要使用最后一条命令) ### 安装 PaddlePaddle -你可以根据系统配置选择 PaddlePaddle 版本,例如系统使用 CUDA 10.2, CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.2.0: +你可以根据系统配置选择 PaddlePaddle 版本,例如系统使用 CUDA 10.2, CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.3.1: ```bash -python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple +python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple ``` ### 安装 PaddleSpeech 最后安装 `paddlespeech`,这样你就可以使用 `paddlespeech` 中已有的 examples: @@ -168,9 +168,9 @@ conda activate tools/venv conda install -y -c conda-forge sox libsndfile swig bzip2 libflac bc ``` ### 安装 PaddlePaddle -请确认你系统是否有 GPU,并且使用了正确版本的 paddlepaddle。例如系统使用 CUDA 10.2, CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.2.0: +请确认你系统是否有 GPU,并且使用了正确版本的 paddlepaddle。例如系统使用 CUDA 10.2, CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.3.1: ```bash -python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple +python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple ``` ### 用开发者模式安装 PaddleSpeech 部分用户系统由于默认源的问题,安装中会出现 kaldiio 安转出错的问题,建议首先安装 pytest-runner: diff --git a/docs/topic/package_release/python_package_release.md b/docs/topic/package_release/python_package_release.md index 3e3f9dbf66d6ab297efd13e58ad25ec8e307e4d5..cb1029e7b66711fbd9657965aaca218db77a83a0 100644 --- a/docs/topic/package_release/python_package_release.md +++ b/docs/topic/package_release/python_package_release.md @@ -150,7 +150,7 @@ manylinux1 支持 Centos5以上, manylinux2010 支持 Centos 6 以上,manyli ### 拉取 manylinux2010 ```bash -docker pull quay.io/pypa/manylinux1_x86_64 +docker pull quay.io/pypa/manylinux2010_x86_64 ``` ### 使用 manylinux2010 diff --git a/examples/aishell/asr1/README.md b/examples/aishell/asr1/README.md index 25b28ede8c9ed7a891527bb934b072c1c3692476..a7390fd68949f2e74de1b82658d9e85d79513b50 100644 --- a/examples/aishell/asr1/README.md +++ b/examples/aishell/asr1/README.md @@ -1,5 +1,5 @@ # Transformer/Conformer ASR with Aishell -This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Aishell dataset](http://www.openslr.org/resources/33) +This example contains code used to train a [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with [Aishell dataset](http://www.openslr.org/resources/33) ## Overview All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function. | Stage | Function | diff --git a/examples/callcenter/README.md b/examples/callcenter/README.md index 1c715cb694dfde06bd9e7d53fafb401e49b8025b..6d5211461b9820d275a247043665ba53814b46da 100644 --- a/examples/callcenter/README.md +++ b/examples/callcenter/README.md @@ -1,20 +1,3 @@ # Callcenter 8k sample rate -Data distribution: - -``` -676048 utts -491.4004722221223 h -4357792.0 text -2.4633630739178654 text/sec -2.6167397877068495 sec/utt -``` - -train/dev/test partition: - -``` - 33802 manifest.dev - 67606 manifest.test - 574640 manifest.train - 676048 total -``` +This recipe only has model/data config for 8k ASR, user need to prepare data and generate manifest metafile. You can see Aishell or Libripseech. diff --git a/examples/csmsc/vits/README.md b/examples/csmsc/vits/README.md index 5ca57e3a3603eb53fe4bf7c16fc1ba51bbc14147..8f223e07b017e9287aa7e4be823ae9e20c50cd03 100644 --- a/examples/csmsc/vits/README.md +++ b/examples/csmsc/vits/README.md @@ -154,7 +154,7 @@ VITS checkpoint contains files listed below. vits_csmsc_ckpt_1.1.0 ├── default.yaml # default config used to train vitx ├── phone_id_map.txt # phone vocabulary file when training vits -└── snapshot_iter_350000.pdz # model parameters and optimizer states +└── snapshot_iter_333000.pdz # model parameters and optimizer states ``` ps: This ckpt is not good enough, a better result is training @@ -169,7 +169,7 @@ FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ python3 ${BIN_DIR}/synthesize_e2e.py \ --config=vits_csmsc_ckpt_1.1.0/default.yaml \ - --ckpt=vits_csmsc_ckpt_1.1.0/snapshot_iter_350000.pdz \ + --ckpt=vits_csmsc_ckpt_1.1.0/snapshot_iter_333000.pdz \ --phones_dict=vits_csmsc_ckpt_1.1.0/phone_id_map.txt \ --output_dir=exp/default/test_e2e \ --text=${BIN_DIR}/../sentences.txt \ diff --git a/examples/csmsc/vits/conf/default.yaml b/examples/csmsc/vits/conf/default.yaml index 32f995cc9489359bc91bb951442c5fde78286724..a2aef998d2843bc4330339120304bf8741601bf3 100644 --- a/examples/csmsc/vits/conf/default.yaml +++ b/examples/csmsc/vits/conf/default.yaml @@ -179,7 +179,7 @@ generator_first: False # whether to start updating generator first # OTHER TRAINING SETTING # ########################################################## num_snapshots: 10 # max number of snapshots to keep while training -train_max_steps: 250000 # Number of training steps. == total_iters / ngpus, total_iters = 1000000 +train_max_steps: 350000 # Number of training steps. == total_iters / ngpus, total_iters = 1000000 save_interval_steps: 1000 # Interval steps to save checkpoint. eval_interval_steps: 250 # Interval steps to evaluate the network. seed: 777 # random seed number diff --git a/examples/librispeech/asr1/README.md b/examples/librispeech/asr1/README.md index ae252a58b58b2bf2602b73c8477e38fa5670a831..ca0081444c7566ffdb47711fa4d4539e70ab9017 100644 --- a/examples/librispeech/asr1/README.md +++ b/examples/librispeech/asr1/README.md @@ -1,5 +1,5 @@ # Transformer/Conformer ASR with Librispeech -This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Librispeech dataset](http://www.openslr.org/resources/12) +This example contains code used to train [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with [Librispeech dataset](http://www.openslr.org/resources/12) ## Overview All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function. | Stage | Function | diff --git a/examples/librispeech/asr2/README.md b/examples/librispeech/asr2/README.md index 5bc7185a9b0cf4a79dad60ecaeefb1e65b3f8bb1..26978520da25c4542b4d93f53c9945f790bff06f 100644 --- a/examples/librispeech/asr2/README.md +++ b/examples/librispeech/asr2/README.md @@ -1,6 +1,6 @@ # Transformer/Conformer ASR with Librispeech ASR2 -This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Librispeech dataset](http://www.openslr.org/resources/12) and use some functions in kaldi. +This example contains code used to train a [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with [Librispeech dataset](http://www.openslr.org/resources/12) and use some functions in kaldi. To use this example, you need to install Kaldi first. diff --git a/examples/tiny/asr1/README.md b/examples/tiny/asr1/README.md index 6a4999aa649f336a51a927f87965458cd7bdaa96..cfa266704519194fe9062752f4058dcc705988fb 100644 --- a/examples/tiny/asr1/README.md +++ b/examples/tiny/asr1/README.md @@ -1,5 +1,5 @@ # Transformer/Conformer ASR with Tiny -This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model Tiny dataset(a part of [[Librispeech dataset](http://www.openslr.org/resources/12)](http://www.openslr.org/resources/33)) +This example contains code used to train a [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with Tiny dataset(a part of [[Librispeech dataset](http://www.openslr.org/resources/12)](http://www.openslr.org/resources/33)) ## Overview All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function. | Stage | Function | diff --git a/examples/vctk/tts3/local/synthesize.sh b/examples/vctk/tts3/local/synthesize.sh index 9e03f9b8a4701c0232ad7a2f1262533dd0a61435..87145959ff5b9affa864804e7804496d4ed5948f 100755 --- a/examples/vctk/tts3/local/synthesize.sh +++ b/examples/vctk/tts3/local/synthesize.sh @@ -31,7 +31,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ python3 ${BIN_DIR}/../synthesize.py \ - --am=fastspeech2_aishell3 \ + --am=fastspeech2_vctk \ --am_config=${config_path} \ --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ --am_stat=dump/train/speech_stats.npy \ diff --git a/paddlespeech/__init__.py b/paddlespeech/__init__.py index b781c4a8e5cc99590e179faf1c4c3989349d4216..4b1c0ef3dea750c0503909dd168624ca845baafb 100644 --- a/paddlespeech/__init__.py +++ b/paddlespeech/__init__.py @@ -14,3 +14,5 @@ import _locale _locale._getdefaultlocale = (lambda *args: ['en_US', 'utf8']) + + diff --git a/paddlespeech/audio/__init__.py b/paddlespeech/audio/__init__.py index 6184c1dd4a7cdf3b47e464209873f56f384904e8..83be8e32e1d96bb5fe27da21232db38a5a6eb062 100644 --- a/paddlespeech/audio/__init__.py +++ b/paddlespeech/audio/__init__.py @@ -14,6 +14,9 @@ from . import compliance from . import datasets from . import features +from . import text +from . import transform +from . import streamdata from . import functional from . import io from . import metric diff --git a/paddlespeech/audio/text/__init__.py b/paddlespeech/audio/text/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..185a92b8d94d3426d616c0624f0f2ee04339349e --- /dev/null +++ b/paddlespeech/audio/text/__init__.py @@ -0,0 +1,13 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/paddlespeech/s2t/__init__.py b/paddlespeech/s2t/__init__.py index 99b8bcbe6ee8b114abc28e91b91b109e8a6f2bf5..5fe2e16b91baf09f129884c59fc1375449751801 100644 --- a/paddlespeech/s2t/__init__.py +++ b/paddlespeech/s2t/__init__.py @@ -18,7 +18,6 @@ from typing import Union import paddle from paddle import nn -from paddle.fluid import core from paddle.nn import functional as F from paddlespeech.s2t.utils.log import Log @@ -39,46 +38,6 @@ paddle.long = 'int64' paddle.uint16 = 'uint16' paddle.cdouble = 'complex128' - -def convert_dtype_to_string(tensor_dtype): - """ - Convert the data type in numpy to the data type in Paddle - Args: - tensor_dtype(core.VarDesc.VarType): the data type in numpy. - Returns: - core.VarDesc.VarType: the data type in Paddle. - """ - dtype = tensor_dtype - if dtype == core.VarDesc.VarType.FP32: - return paddle.float32 - elif dtype == core.VarDesc.VarType.FP64: - return paddle.float64 - elif dtype == core.VarDesc.VarType.FP16: - return paddle.float16 - elif dtype == core.VarDesc.VarType.INT32: - return paddle.int32 - elif dtype == core.VarDesc.VarType.INT16: - return paddle.int16 - elif dtype == core.VarDesc.VarType.INT64: - return paddle.int64 - elif dtype == core.VarDesc.VarType.BOOL: - return paddle.bool - elif dtype == core.VarDesc.VarType.BF16: - # since there is still no support for bfloat16 in NumPy, - # uint16 is used for casting bfloat16 - return paddle.uint16 - elif dtype == core.VarDesc.VarType.UINT8: - return paddle.uint8 - elif dtype == core.VarDesc.VarType.INT8: - return paddle.int8 - elif dtype == core.VarDesc.VarType.COMPLEX64: - return paddle.complex64 - elif dtype == core.VarDesc.VarType.COMPLEX128: - return paddle.complex128 - else: - raise ValueError("Not supported tensor dtype %s" % dtype) - - if not hasattr(paddle, 'softmax'): logger.debug("register user softmax to paddle, remove this when fixed!") setattr(paddle, 'softmax', paddle.nn.functional.softmax) @@ -156,25 +115,6 @@ if not hasattr(paddle.Tensor, 'new_full'): paddle.static.Variable.new_full = new_full -def eq(xs: paddle.Tensor, ys: Union[paddle.Tensor, float]) -> paddle.Tensor: - if convert_dtype_to_string(xs.dtype) == paddle.bool: - xs = xs.astype(paddle.int) - return xs.equal(ys) - - -if not hasattr(paddle.Tensor, 'eq'): - logger.debug( - "override eq of paddle.Tensor if exists or register, remove this when fixed!" - ) - paddle.Tensor.eq = eq - paddle.static.Variable.eq = eq - -if not hasattr(paddle, 'eq'): - logger.debug( - "override eq of paddle if exists or register, remove this when fixed!") - paddle.eq = eq - - def contiguous(xs: paddle.Tensor) -> paddle.Tensor: return xs diff --git a/paddlespeech/s2t/models/u2/u2.py b/paddlespeech/s2t/models/u2/u2.py index 3af3536000fdd8eeac9ae78d72f1c10116b23f52..c7750184866850e2ba2eb0bd27a2557a5e212b6e 100644 --- a/paddlespeech/s2t/models/u2/u2.py +++ b/paddlespeech/s2t/models/u2/u2.py @@ -318,7 +318,7 @@ class U2BaseModel(ASRInterface, nn.Layer): dim=1) # (B*N, i+1) # 2.6 Update end flag - end_flag = paddle.eq(hyps[:, -1], self.eos).view(-1, 1) + end_flag = paddle.equal(hyps[:, -1], self.eos).view(-1, 1) # 3. Select best of best scores = scores.view(batch_size, beam_size) diff --git a/paddlespeech/s2t/modules/align.py b/paddlespeech/s2t/modules/align.py index ad71ee02166ef850309ae7b38f9550941fcfb3de..cacda246148f3efaff6225a01ec76ace57ed9095 100644 --- a/paddlespeech/s2t/modules/align.py +++ b/paddlespeech/s2t/modules/align.py @@ -13,8 +13,7 @@ # limitations under the License. import paddle from paddle import nn - -from paddlespeech.s2t.modules.initializer import KaimingUniform +import math """ To align the initializer between paddle and torch, the API below are set defalut initializer with priority higger than global initializer. @@ -82,10 +81,10 @@ class Linear(nn.Linear): name=None): if weight_attr is None: if global_init_type == "kaiming_uniform": - weight_attr = paddle.ParamAttr(initializer=KaimingUniform()) + weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu')) if bias_attr is None: if global_init_type == "kaiming_uniform": - bias_attr = paddle.ParamAttr(initializer=KaimingUniform()) + bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu')) super(Linear, self).__init__(in_features, out_features, weight_attr, bias_attr, name) @@ -105,10 +104,10 @@ class Conv1D(nn.Conv1D): data_format='NCL'): if weight_attr is None: if global_init_type == "kaiming_uniform": - weight_attr = paddle.ParamAttr(initializer=KaimingUniform()) + weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu')) if bias_attr is None: if global_init_type == "kaiming_uniform": - bias_attr = paddle.ParamAttr(initializer=KaimingUniform()) + bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu')) super(Conv1D, self).__init__( in_channels, out_channels, kernel_size, stride, padding, dilation, groups, padding_mode, weight_attr, bias_attr, data_format) @@ -129,10 +128,10 @@ class Conv2D(nn.Conv2D): data_format='NCHW'): if weight_attr is None: if global_init_type == "kaiming_uniform": - weight_attr = paddle.ParamAttr(initializer=KaimingUniform()) + weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu')) if bias_attr is None: if global_init_type == "kaiming_uniform": - bias_attr = paddle.ParamAttr(initializer=KaimingUniform()) + bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu')) super(Conv2D, self).__init__( in_channels, out_channels, kernel_size, stride, padding, dilation, groups, padding_mode, weight_attr, bias_attr, data_format) diff --git a/paddlespeech/s2t/modules/attention.py b/paddlespeech/s2t/modules/attention.py index 454f9c1477462903dfa54f26004e2f19fae4c5b7..b6d615867723301ca4f8df941ca5d11cbaa93d8b 100644 --- a/paddlespeech/s2t/modules/attention.py +++ b/paddlespeech/s2t/modules/attention.py @@ -109,7 +109,7 @@ class MultiHeadedAttention(nn.Layer): # 1. onnx(16/-1, -1/-1, 16/0) # 2. jit (16/-1, -1/-1, 16/0, 16/4) if paddle.shape(mask)[2] > 0: # time2 > 0 - mask = mask.unsqueeze(1).eq(0) # (batch, 1, *, time2) + mask = mask.unsqueeze(1).equal(0) # (batch, 1, *, time2) # for last chunk, time2 might be larger than scores.size(-1) mask = mask[:, :, :, :paddle.shape(scores)[-1]] scores = scores.masked_fill(mask, -float('inf')) @@ -321,4 +321,4 @@ class RelPositionMultiHeadedAttention(MultiHeadedAttention): scores = (matrix_ac + matrix_bd) / math.sqrt( self.d_k) # (batch, head, time1, time2) - return self.forward_attention(v, scores, mask), new_cache \ No newline at end of file + return self.forward_attention(v, scores, mask), new_cache diff --git a/paddlespeech/s2t/modules/initializer.py b/paddlespeech/s2t/modules/initializer.py index 30a04e44fb2965d03be8c6346ef16448ed257bbc..cdcf2e0523a3d72b5cba077a36879eb65acd4ce2 100644 --- a/paddlespeech/s2t/modules/initializer.py +++ b/paddlespeech/s2t/modules/initializer.py @@ -12,142 +12,6 @@ # See the License for the specific language governing permissions and # limitations under the License. import numpy as np -from paddle.fluid import framework -from paddle.fluid import unique_name -from paddle.fluid.core import VarDesc -from paddle.fluid.initializer import MSRAInitializer - -__all__ = ['KaimingUniform'] - - -class KaimingUniform(MSRAInitializer): - r"""Implements the Kaiming Uniform initializer - - This class implements the weight initialization from the paper - `Delving Deep into Rectifiers: Surpassing Human-Level Performance on - ImageNet Classification `_ - by Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun. This is a - robust initialization method that particularly considers the rectifier - nonlinearities. - - In case of Uniform distribution, the range is [-x, x], where - - .. math:: - - x = \sqrt{\frac{1.0}{fan\_in}} - - In case of Normal distribution, the mean is 0 and the standard deviation - is - - .. math:: - - \sqrt{\\frac{2.0}{fan\_in}} - - Args: - fan_in (float32|None): fan_in for Kaiming uniform Initializer. If None, it is\ - inferred from the variable. default is None. - - Note: - It is recommended to set fan_in to None for most cases. - - Examples: - .. code-block:: python - - import paddle - import paddle.nn as nn - - linear = nn.Linear(2, - 4, - weight_attr=nn.initializer.KaimingUniform()) - data = paddle.rand([30, 10, 2], dtype='float32') - res = linear(data) - - """ - - def __init__(self, fan_in=None): - super(KaimingUniform, self).__init__( - uniform=True, fan_in=fan_in, seed=0) - - def __call__(self, var, block=None): - """Initialize the input tensor with MSRA initialization. - - Args: - var(Tensor): Tensor that needs to be initialized. - block(Block, optional): The block in which initialization ops - should be added. Used in static graph only, default None. - - Returns: - The initialization op - """ - block = self._check_block(block) - - assert isinstance(var, framework.Variable) - assert isinstance(block, framework.Block) - f_in, f_out = self._compute_fans(var) - - # If fan_in is passed, use it - fan_in = f_in if self._fan_in is None else self._fan_in - - if self._seed == 0: - self._seed = block.program.random_seed - - # to be compatible of fp16 initalizers - if var.dtype == VarDesc.VarType.FP16 or ( - var.dtype == VarDesc.VarType.BF16 and not self._uniform): - out_dtype = VarDesc.VarType.FP32 - out_var = block.create_var( - name=unique_name.generate( - ".".join(['masra_init', var.name, 'tmp'])), - shape=var.shape, - dtype=out_dtype, - type=VarDesc.VarType.LOD_TENSOR, - persistable=False) - else: - out_dtype = var.dtype - out_var = var - - if self._uniform: - limit = np.sqrt(1.0 / float(fan_in)) - op = block.append_op( - type="uniform_random", - inputs={}, - outputs={"Out": out_var}, - attrs={ - "shape": out_var.shape, - "dtype": int(out_dtype), - "min": -limit, - "max": limit, - "seed": self._seed - }, - stop_gradient=True) - - else: - std = np.sqrt(2.0 / float(fan_in)) - op = block.append_op( - type="gaussian_random", - outputs={"Out": out_var}, - attrs={ - "shape": out_var.shape, - "dtype": int(out_dtype), - "mean": 0.0, - "std": std, - "seed": self._seed - }, - stop_gradient=True) - - if var.dtype == VarDesc.VarType.FP16 or ( - var.dtype == VarDesc.VarType.BF16 and not self._uniform): - block.append_op( - type="cast", - inputs={"X": out_var}, - outputs={"Out": var}, - attrs={"in_dtype": out_var.dtype, - "out_dtype": var.dtype}) - - if not framework.in_dygraph_mode(): - var.op = op - return op - class DefaultInitializerContext(object): """ diff --git a/paddlespeech/server/bin/paddlespeech_client.py b/paddlespeech/server/bin/paddlespeech_client.py index e8e57fff052d7e67403b60474110b60422372ba5..96368c0f374d9922201f5968e414d52e54ca602d 100644 --- a/paddlespeech/server/bin/paddlespeech_client.py +++ b/paddlespeech/server/bin/paddlespeech_client.py @@ -718,6 +718,7 @@ class VectorClientExecutor(BaseExecutor): logger.info(f"the input audio: {input}") handler = VectorHttpHandler(server_ip=server_ip, port=port) res = handler.run(input, audio_format, sample_rate) + logger.info(f"The spk embedding is: {res}") return res elif task == "score": from paddlespeech.server.utils.audio_handler import VectorScoreHttpHandler diff --git a/paddlespeech/server/engine/acs/python/acs_engine.py b/paddlespeech/server/engine/acs/python/acs_engine.py index 63964a82550c754f58209a5c966da78110210b44..a607aa07a60834bbb289171ad6077f76a7686fff 100644 --- a/paddlespeech/server/engine/acs/python/acs_engine.py +++ b/paddlespeech/server/engine/acs/python/acs_engine.py @@ -192,12 +192,15 @@ class ACSEngine(BaseEngine): # search for each word in self.word_list offset = self.config.offset + # last time in time_stamp max_ed = time_stamp[-1]['ed'] for w in self.word_list: # search the w in asr_result and the index in asr_result + # https://docs.python.org/3/library/re.html#re.finditer for m in re.finditer(w, asr_result): + # match start and end char index in timestamp + # https://docs.python.org/3/library/re.html#re.Match.start start = max(time_stamp[m.start(0)]['bg'] - offset, 0) - end = min(time_stamp[m.end(0) - 1]['ed'] + offset, max_ed) logger.debug(f'start: {start}, end: {end}') acs_result.append({'w': w, 'bg': start, 'ed': end}) diff --git a/paddlespeech/server/engine/asr/online/ctc_endpoint.py b/paddlespeech/server/engine/asr/online/ctc_endpoint.py index 2dba36417f130c0a081876ec3df8afaf98f9ff67..b87dbe8058f9eeca8f734eb1ade5c99f123b2473 100644 --- a/paddlespeech/server/engine/asr/online/ctc_endpoint.py +++ b/paddlespeech/server/engine/asr/online/ctc_endpoint.py @@ -39,10 +39,10 @@ class OnlineCTCEndpoingOpt: # rule1 times out after 5 seconds of silence, even if we decoded nothing. rule1: OnlineCTCEndpointRule = OnlineCTCEndpointRule(False, 5000, 0) - # rule4 times out after 1.0 seconds of silence after decoding something, + # rule2 times out after 1.0 seconds of silence after decoding something, # even if we did not reach a final-state at all. rule2: OnlineCTCEndpointRule = OnlineCTCEndpointRule(True, 1000, 0) - # rule5 times out after the utterance is 20 seconds long, regardless of + # rule3 times out after the utterance is 20 seconds long, regardless of # anything else. rule3: OnlineCTCEndpointRule = OnlineCTCEndpointRule(False, 0, 20000) @@ -102,7 +102,8 @@ class OnlineCTCEndpoint: assert self.num_frames_decoded >= self.trailing_silence_frames assert self.frame_shift_in_ms > 0 - + + decoding_something = (self.num_frames_decoded > self.trailing_silence_frames) and decoding_something utterance_length = self.num_frames_decoded * self.frame_shift_in_ms trailing_silence = self.trailing_silence_frames * self.frame_shift_in_ms diff --git a/paddlespeech/server/engine/asr/online/ctc_search.py b/paddlespeech/server/engine/asr/online/ctc_search.py index 46f310c80fe96fcbfc1d9dbc83b9f6f8161e24a1..06adb9cccc99505388184a32cbf3c72c9bfd9e12 100644 --- a/paddlespeech/server/engine/asr/online/ctc_search.py +++ b/paddlespeech/server/engine/asr/online/ctc_search.py @@ -83,11 +83,11 @@ class CTCPrefixBeamSearch: # cur_hyps: (prefix, (blank_ending_score, none_blank_ending_score)) # 0. blank_ending_score, # 1. none_blank_ending_score, - # 2. viterbi_blank ending, - # 3. viterbi_non_blank, + # 2. viterbi_blank ending score, + # 3. viterbi_non_blank score, # 4. current_token_prob, - # 5. times_viterbi_blank, - # 6. times_titerbi_non_blank + # 5. times_viterbi_blank, times_b + # 6. times_titerbi_non_blank, times_nb if self.cur_hyps is None: self.cur_hyps = [(tuple(), (0.0, -float('inf'), 0.0, 0.0, -float('inf'), [], []))] @@ -106,69 +106,69 @@ class CTCPrefixBeamSearch: for s in top_k_index: s = s.item() ps = logp[s].item() - for prefix, (pb, pnb, v_b_s, v_nb_s, cur_token_prob, times_s, - times_ns) in self.cur_hyps: + for prefix, (pb, pnb, v_b_s, v_nb_s, cur_token_prob, times_b, + times_nb) in self.cur_hyps: last = prefix[-1] if len(prefix) > 0 else None if s == blank_id: # blank - n_pb, n_pnb, n_v_s, n_v_ns, n_cur_token_prob, n_times_s, n_times_ns = next_hyps[ + n_pb, n_pnb, n_v_b, n_v_nb, n_cur_token_prob, n_times_b, n_times_nb = next_hyps[ prefix] n_pb = log_add([n_pb, pb + ps, pnb + ps]) - pre_times = times_s if v_b_s > v_nb_s else times_ns - n_times_s = copy.deepcopy(pre_times) + pre_times = times_b if v_b_s > v_nb_s else times_nb + n_times_b = copy.deepcopy(pre_times) viterbi_score = v_b_s if v_b_s > v_nb_s else v_nb_s - n_v_s = viterbi_score + ps - next_hyps[prefix] = (n_pb, n_pnb, n_v_s, n_v_ns, - n_cur_token_prob, n_times_s, - n_times_ns) + n_v_b = viterbi_score + ps + next_hyps[prefix] = (n_pb, n_pnb, n_v_b, n_v_nb, + n_cur_token_prob, n_times_b, + n_times_nb) elif s == last: # Update *ss -> *s; # case1: *a + a => *a - n_pb, n_pnb, n_v_s, n_v_ns, n_cur_token_prob, n_times_s, n_times_ns = next_hyps[ + n_pb, n_pnb, n_v_b, n_v_nb, n_cur_token_prob, n_times_b, n_times_nb = next_hyps[ prefix] n_pnb = log_add([n_pnb, pnb + ps]) - if n_v_ns < v_nb_s + ps: - n_v_ns = v_nb_s + ps + if n_v_nb < v_nb_s + ps: + n_v_nb = v_nb_s + ps if n_cur_token_prob < ps: n_cur_token_prob = ps - n_times_ns = copy.deepcopy(times_ns) - n_times_ns[ + n_times_nb = copy.deepcopy(times_nb) + n_times_nb[ -1] = self.abs_time_step # 注意,这里要重新使用绝对时间 - next_hyps[prefix] = (n_pb, n_pnb, n_v_s, n_v_ns, - n_cur_token_prob, n_times_s, - n_times_ns) + next_hyps[prefix] = (n_pb, n_pnb, n_v_b, n_v_nb, + n_cur_token_prob, n_times_b, + n_times_nb) # Update *s-s -> *ss, - is for blank # Case 2: *aε + a => *aa n_prefix = prefix + (s, ) - n_pb, n_pnb, n_v_s, n_v_ns, n_cur_token_prob, n_times_s, n_times_ns = next_hyps[ + n_pb, n_pnb, n_v_b, n_v_nb, n_cur_token_prob, n_times_b, n_times_nb = next_hyps[ n_prefix] - if n_v_ns < v_b_s + ps: - n_v_ns = v_b_s + ps + if n_v_nb < v_b_s + ps: + n_v_nb = v_b_s + ps n_cur_token_prob = ps - n_times_ns = copy.deepcopy(times_s) - n_times_ns.append(self.abs_time_step) + n_times_nb = copy.deepcopy(times_b) + n_times_nb.append(self.abs_time_step) n_pnb = log_add([n_pnb, pb + ps]) - next_hyps[n_prefix] = (n_pb, n_pnb, n_v_s, n_v_ns, - n_cur_token_prob, n_times_s, - n_times_ns) + next_hyps[n_prefix] = (n_pb, n_pnb, n_v_b, n_v_nb, + n_cur_token_prob, n_times_b, + n_times_nb) else: # Case 3: *a + b => *ab, *aε + b => *ab n_prefix = prefix + (s, ) - n_pb, n_pnb, n_v_s, n_v_ns, n_cur_token_prob, n_times_s, n_times_ns = next_hyps[ + n_pb, n_pnb, n_v_b, n_v_nb, n_cur_token_prob, n_times_b, n_times_nb = next_hyps[ n_prefix] viterbi_score = v_b_s if v_b_s > v_nb_s else v_nb_s - pre_times = times_s if v_b_s > v_nb_s else times_ns - if n_v_ns < viterbi_score + ps: - n_v_ns = viterbi_score + ps + pre_times = times_b if v_b_s > v_nb_s else times_nb + if n_v_nb < viterbi_score + ps: + n_v_nb = viterbi_score + ps n_cur_token_prob = ps - n_times_ns = copy.deepcopy(pre_times) - n_times_ns.append(self.abs_time_step) + n_times_nb = copy.deepcopy(pre_times) + n_times_nb.append(self.abs_time_step) n_pnb = log_add([n_pnb, pb + ps, pnb + ps]) - next_hyps[n_prefix] = (n_pb, n_pnb, n_v_s, n_v_ns, - n_cur_token_prob, n_times_s, - n_times_ns) + next_hyps[n_prefix] = (n_pb, n_pnb, n_v_b, n_v_nb, + n_cur_token_prob, n_times_b, + n_times_nb) # 2.2 Second beam prune next_hyps = sorted( diff --git a/paddlespeech/server/utils/onnx_infer.py b/paddlespeech/server/utils/onnx_infer.py index 23d83c735a7b1aa0853c37db8e231656712110f2..25802f627793ee99f12bdb97c5c309b1cf5573ca 100644 --- a/paddlespeech/server/utils/onnx_infer.py +++ b/paddlespeech/server/utils/onnx_infer.py @@ -30,7 +30,9 @@ def get_sess(model_path: Optional[os.PathLike]=None, sess_conf: dict=None): # "gpu:0" providers = ['CPUExecutionProvider'] if "gpu" in sess_conf.get("device", ""): - providers = ['CUDAExecutionProvider'] + device_id = int(sess_conf["device"].split(":")[1]) + providers = [('CUDAExecutionProvider', {'device_id': device_id})] + # fastspeech2/mb_melgan can't use trt now! if sess_conf.get("use_trt", 0): providers = ['TensorrtExecutionProvider'] diff --git a/paddlespeech/t2s/exps/syn_utils.py b/paddlespeech/t2s/exps/syn_utils.py index cabea98972e147955317abf76aee069739eed49f..77abf97d9227f0a5df042a1a831d6cdc902bc499 100644 --- a/paddlespeech/t2s/exps/syn_utils.py +++ b/paddlespeech/t2s/exps/syn_utils.py @@ -29,6 +29,7 @@ from yacs.config import CfgNode from paddlespeech.t2s.datasets.data_table import DataTable from paddlespeech.t2s.frontend import English +from paddlespeech.t2s.frontend.mix_frontend import MixFrontend from paddlespeech.t2s.frontend.zh_frontend import Frontend from paddlespeech.t2s.modules.normalizer import ZScore from paddlespeech.utils.dynamic_import import dynamic_import @@ -98,6 +99,8 @@ def get_sentences(text_file: Optional[os.PathLike], lang: str='zh'): sentence = "".join(items[1:]) elif lang == 'en': sentence = " ".join(items[1:]) + elif lang == 'mix': + sentence = " ".join(items[1:]) sentences.append((utt_id, sentence)) return sentences @@ -111,7 +114,8 @@ def get_test_dataset(test_metadata: List[Dict[str, Any]], am_dataset = am[am.rindex('_') + 1:] if am_name == 'fastspeech2': fields = ["utt_id", "text"] - if am_dataset in {"aishell3", "vctk"} and speaker_dict is not None: + if am_dataset in {"aishell3", "vctk", + "mix"} and speaker_dict is not None: print("multiple speaker fastspeech2!") fields += ["spk_id"] elif voice_cloning: @@ -140,6 +144,10 @@ def get_frontend(lang: str='zh', phone_vocab_path=phones_dict, tone_vocab_path=tones_dict) elif lang == 'en': frontend = English(phone_vocab_path=phones_dict) + elif lang == 'mix': + frontend = MixFrontend( + phone_vocab_path=phones_dict, tone_vocab_path=tones_dict) + else: print("wrong lang!") print("frontend done!") @@ -341,8 +349,12 @@ def get_am_output( input_ids = frontend.get_input_ids( input, merge_sentences=merge_sentences) phone_ids = input_ids["phone_ids"] + elif lang == 'mix': + input_ids = frontend.get_input_ids( + input, merge_sentences=merge_sentences) + phone_ids = input_ids["phone_ids"] else: - print("lang should in {'zh', 'en'}!") + print("lang should in {'zh', 'en', 'mix'}!") if get_tone_ids: tone_ids = input_ids["tone_ids"] diff --git a/paddlespeech/t2s/exps/synthesize_e2e.py b/paddlespeech/t2s/exps/synthesize_e2e.py index 28657eb27257f69f93589f65bcfdcfdd95ee52e8..ef9543296201f4d5b1dc4c5e59ae9a28cf956f27 100644 --- a/paddlespeech/t2s/exps/synthesize_e2e.py +++ b/paddlespeech/t2s/exps/synthesize_e2e.py @@ -113,8 +113,12 @@ def evaluate(args): input_ids = frontend.get_input_ids( sentence, merge_sentences=merge_sentences) phone_ids = input_ids["phone_ids"] + elif args.lang == 'mix': + input_ids = frontend.get_input_ids( + sentence, merge_sentences=merge_sentences) + phone_ids = input_ids["phone_ids"] else: - print("lang should in {'zh', 'en'}!") + print("lang should in {'zh', 'en', 'mix'}!") with paddle.no_grad(): flags = 0 for i in range(len(phone_ids)): @@ -122,7 +126,7 @@ def evaluate(args): # acoustic model if am_name == 'fastspeech2': # multi speaker - if am_dataset in {"aishell3", "vctk"}: + if am_dataset in {"aishell3", "vctk", "mix"}: spk_id = paddle.to_tensor(args.spk_id) mel = am_inference(part_phone_ids, spk_id) else: @@ -170,7 +174,7 @@ def parse_args(): choices=[ 'speedyspeech_csmsc', 'speedyspeech_aishell3', 'fastspeech2_csmsc', 'fastspeech2_ljspeech', 'fastspeech2_aishell3', 'fastspeech2_vctk', - 'tacotron2_csmsc', 'tacotron2_ljspeech' + 'tacotron2_csmsc', 'tacotron2_ljspeech', 'fastspeech2_mix' ], help='Choose acoustic model type of tts task.') parser.add_argument( @@ -231,7 +235,7 @@ def parse_args(): '--lang', type=str, default='zh', - help='Choose model language. zh or en') + help='Choose model language. zh or en or mix') parser.add_argument( "--inference_dir", diff --git a/paddlespeech/t2s/frontend/mix_frontend.py b/paddlespeech/t2s/frontend/mix_frontend.py new file mode 100644 index 0000000000000000000000000000000000000000..6386c871e138c745843f2e75a135fde7fdfd5ea2 --- /dev/null +++ b/paddlespeech/t2s/frontend/mix_frontend.py @@ -0,0 +1,179 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import re +from typing import Dict +from typing import List + +import paddle + +from paddlespeech.t2s.frontend import English +from paddlespeech.t2s.frontend.zh_frontend import Frontend + + +class MixFrontend(): + def __init__(self, + g2p_model="pypinyin", + phone_vocab_path=None, + tone_vocab_path=None): + + self.zh_frontend = Frontend( + phone_vocab_path=phone_vocab_path, tone_vocab_path=tone_vocab_path) + self.en_frontend = English(phone_vocab_path=phone_vocab_path) + self.SENTENCE_SPLITOR = re.compile(r'([:、,;。?!,;?!][”’]?)') + self.sp_id = self.zh_frontend.vocab_phones["sp"] + self.sp_id_tensor = paddle.to_tensor([self.sp_id]) + + def is_chinese(self, char): + if char >= '\u4e00' and char <= '\u9fa5': + return True + else: + return False + + def is_alphabet(self, char): + if (char >= '\u0041' and char <= '\u005a') or (char >= '\u0061' and + char <= '\u007a'): + return True + else: + return False + + def is_number(self, char): + if char >= '\u0030' and char <= '\u0039': + return True + else: + return False + + def is_other(self, char): + if not (self.is_chinese(char) or self.is_number(char) or + self.is_alphabet(char)): + return True + else: + return False + + def _split(self, text: str) -> List[str]: + text = re.sub(r'[《》【】<=>{}()()#&@“”^_|…\\]', '', text) + text = self.SENTENCE_SPLITOR.sub(r'\1\n', text) + text = text.strip() + sentences = [sentence.strip() for sentence in re.split(r'\n+', text)] + return sentences + + def _distinguish(self, text: str) -> List[str]: + # sentence --> [ch_part, en_part, ch_part, ...] + + segments = [] + types = [] + + flag = 0 + temp_seg = "" + temp_lang = "" + + # Determine the type of each character. type: blank, chinese, alphabet, number, unk. + for ch in text: + if self.is_chinese(ch): + types.append("zh") + elif self.is_alphabet(ch): + types.append("en") + elif ch == " ": + types.append("blank") + elif self.is_number(ch): + types.append("num") + else: + types.append("unk") + + assert len(types) == len(text) + + for i in range(len(types)): + + # find the first char of the seg + if flag == 0: + if types[i] != "unk" and types[i] != "blank": + temp_seg += text[i] + temp_lang = types[i] + flag = 1 + + else: + if types[i] == temp_lang or types[i] == "num": + temp_seg += text[i] + + elif temp_lang == "num" and types[i] != "unk": + temp_seg += text[i] + if types[i] == "zh" or types[i] == "en": + temp_lang = types[i] + + elif temp_lang == "en" and types[i] == "blank": + temp_seg += text[i] + + elif types[i] == "unk": + pass + + else: + segments.append((temp_seg, temp_lang)) + + if types[i] != "unk" and types[i] != "blank": + temp_seg = text[i] + temp_lang = types[i] + flag = 1 + else: + flag = 0 + temp_seg = "" + temp_lang = "" + + segments.append((temp_seg, temp_lang)) + + return segments + + def get_input_ids(self, + sentence: str, + merge_sentences: bool=True, + get_tone_ids: bool=False, + add_sp: bool=True) -> Dict[str, List[paddle.Tensor]]: + + sentences = self._split(sentence) + phones_list = [] + result = {} + + for text in sentences: + phones_seg = [] + segments = self._distinguish(text) + for seg in segments: + content = seg[0] + lang = seg[1] + if lang == "zh": + input_ids = self.zh_frontend.get_input_ids( + content, + merge_sentences=True, + get_tone_ids=get_tone_ids) + + elif lang == "en": + input_ids = self.en_frontend.get_input_ids( + content, merge_sentences=True) + + phones_seg.append(input_ids["phone_ids"][0]) + if add_sp: + phones_seg.append(self.sp_id_tensor) + + phones = paddle.concat(phones_seg) + phones_list.append(phones) + + if merge_sentences: + merge_list = paddle.concat(phones_list) + # rm the last 'sp' to avoid the noise at the end + # cause in the training data, no 'sp' in the end + if merge_list[-1] == self.sp_id_tensor: + merge_list = merge_list[:-1] + phones_list = [] + phones_list.append(merge_list) + + result["phone_ids"] = phones_list + + return result diff --git a/setup.py b/setup.py index 63e571b9e006edddb68ef3eafac833c101139d65..1cc82fa76f0ee4e7a404675b5dd35b84884b50da 100644 --- a/setup.py +++ b/setup.py @@ -50,6 +50,7 @@ base = [ "paddlespeech_feat", "Pillow>=9.0.0" "praatio==5.0.0", + "protobuf>=3.1.0, <=3.20.0", "pypinyin", "pypinyin-dict", "python-dateutil",