diff --git a/README_cn.md b/README_cn.md index 8f149d63ecf7bbf9774c308d50055d6f484218f0..896c575ce9b12cb71c21d3e49792d74d442d5c5a 100644 --- a/README_cn.md +++ b/README_cn.md @@ -20,7 +20,8 @@

- 快速开始 + 安装 + | 快速开始 | 快速使用服务 | 快速使用流式服务 | 教程文档 @@ -38,6 +39,8 @@ **PaddleSpeech** 荣获 [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/), 请访问 [Arxiv](https://arxiv.org/abs/2205.12007) 论文。 +### 效果展示 + ##### 语音识别
@@ -154,7 +157,7 @@ 本项目采用了易用、高效、灵活以及可扩展的实现,旨在为工业应用、学术研究提供更好的支持,实现的功能包含训练、推断以及测试模块,以及部署过程,主要包括 - 📦 **易用性**: 安装门槛低,可使用 [CLI](#quick-start) 快速开始。 - 🏆 **对标 SoTA**: 提供了高速、轻量级模型,且借鉴了最前沿的技术。 -- 🏆 **流式ASR和TTS系统**:工业级的端到端流式识别、流式合成系统。 +- 🏆 **流式 ASR 和 TTS 系统**:工业级的端到端流式识别、流式合成系统。 - 💯 **基于规则的中文前端**: 我们的前端包含文本正则化和字音转换(G2P)。此外,我们使用自定义语言规则来适应中文语境。 - **多种工业界以及学术界主流功能支持**: - 🛎️ 典型音频任务: 本工具包提供了音频任务如音频分类、语音翻译、自动语音识别、文本转语音、语音合成、声纹识别、KWS等任务的实现。 @@ -182,61 +185,195 @@
+ ## 安装 我们强烈建议用户在 **Linux** 环境下,*3.7* 以上版本的 *python* 上安装 PaddleSpeech。 -目前为止,**Linux** 支持声音分类、语音识别、语音合成和语音翻译四种功能,**Mac OSX、 Windows** 下暂不支持语音翻译功能。 想了解具体安装细节,可以参考[安装文档](./docs/source/install_cn.md)。 + +### 相关依赖 ++ gcc >= 4.8.5 ++ paddlepaddle >= 2.3.1 ++ python >= 3.7 ++ linux(推荐), mac, windows + +PaddleSpeech依赖于paddlepaddle,安装可以参考[paddlepaddle官网](https://www.paddlepaddle.org.cn/),根据自己机器的情况进行选择。这里给出cpu版本示例,其它版本大家可以根据自己机器的情况进行安装。 + +```shell +pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple +``` + +PaddleSpeech快速安装方式有两种,一种是pip安装,一种是源码编译(推荐)。 + +### pip 安装 +```shell +pip install pytest-runner +pip install paddlespeech +``` + +### 源码编译 +```shell +git clone https://github.com/PaddlePaddle/PaddleSpeech.git +cd PaddleSpeech +pip install pytest-runner +pip install . +``` + +更多关于安装问题,如 conda 环境,librosa 依赖的系统库,gcc 环境问题,kaldi 安装等,可以参考这篇[安装文档](docs/source/install_cn.md),如安装上遇到问题可以在 [#2150](https://github.com/PaddlePaddle/PaddleSpeech/issues/2150) 上留言以及查找相关问题 ## 快速开始 -安装完成后,开发者可以通过命令行快速开始,改变 `--input` 可以尝试用自己的音频或文本测试。 +安装完成后,开发者可以通过命令行或者Python快速开始,命令行模式下改变 `--input` 可以尝试用自己的音频或文本测试,支持16k wav格式音频。 + +你也可以在`aistudio`中快速体验 👉🏻[PaddleSpeech API Demo ](https://aistudio.baidu.com/aistudio/projectdetail/4281335?shared=1)。 -**声音分类** +测试音频示例下载 ```shell -paddlespeech cls --input input.wav +wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav +wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav ``` -**声纹识别** + +### 语音识别 +
 (点击可展开)开源中文语音识别 + +命令行一键体验 + ```shell -paddlespeech vector --task spk --input input_16k.wav +paddlespeech asr --lang zh --input zh.wav +``` + +Python API 一键预测 + +```python +>>> from paddlespeech.cli.asr.infer import ASRExecutor +>>> asr = ASRExecutor() +>>> result = asr(audio_file="zh.wav") +>>> print(result) +我认为跑步最重要的就是给我带来了身体健康 ``` -**语音识别** +
+ +### 语音合成 + +
 开源中文语音合成 + +输出 24k 采样率wav格式音频 + + +命令行一键体验 + ```shell -paddlespeech asr --lang zh --input input_16k.wav +paddlespeech tts --input "你好,欢迎使用百度飞桨深度学习框架!" --output output.wav +``` + +Python API 一键预测 + +```python +>>> from paddlespeech.cli.tts.infer import TTSExecutor +>>> tts = TTSExecutor() +>>> tts(text="今天天气十分不错。", output="output.wav") ``` -**语音翻译** (English to Chinese) +- 语音合成的 web demo 已经集成进了 [Huggingface Spaces](https://huggingface.co/spaces). 请参考: [TTS Demo](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS) + +
+ +### 声音分类 + +
 适配多场景的开放领域声音分类工具 + +基于AudioSet数据集527个类别的声音分类模型 + +命令行一键体验 + ```shell -paddlespeech st --input input_16k.wav +paddlespeech cls --input zh.wav ``` -**语音合成** + +python API 一键预测 + +```python +>>> from paddlespeech.cli.cls.infer import CLSExecutor +>>> cls = CLSExecutor() +>>> result = cls(audio_file="zh.wav") +>>> print(result) +Speech 0.9027186632156372 +``` + +
+ +### 声纹提取 + +
 工业级声纹提取工具 + +命令行一键体验 + ```shell -paddlespeech tts --input "你好,欢迎使用百度飞桨深度学习框架!" --output output.wav +paddlespeech vector --task spk --input zh.wav ``` -- 语音合成的 web demo 已经集成进了 [Huggingface Spaces](https://huggingface.co/spaces). 请参考: [TTS Demo](https://huggingface.co/spaces/akhaliq/paddlespeech) -**文本后处理** - - 标点恢复 - ```bash - paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭 - ``` +Python API 一键预测 -**批处理** +```python +>>> from paddlespeech.cli.vector import VectorExecutor +>>> vec = VectorExecutor() +>>> result = vec(audio_file="zh.wav") +>>> print(result) # 187维向量 +[ -0.19083306 9.474295 -14.122263 -2.0916545 0.04848729 + 4.9295826 1.4780062 0.3733844 10.695862 3.2697146 + -4.48199 -0.6617882 -9.170393 -11.1568775 -1.2358263 ...] ``` -echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts + +
+ +### 标点恢复 + +
 一键恢复文本标点,可与ASR模型配合使用 + +命令行一键体验 + +```shell +paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭 +``` + +Python API 一键预测 + +```python +>>> from paddlespeech.cli.text.infer import TextExecutor +>>> text_punc = TextExecutor() +>>> result = text_punc(text="今天的天气真不错啊你下午有空吗我想约你一起去吃饭") +今天的天气真不错啊!你下午有空吗?我想约你一起去吃饭。 ``` -**Shell管道** -ASR + Punc: +
+ +### 语音翻译 + +
 端到端英译中语音翻译工具 + +使用预编译的kaldi相关工具,只支持在Ubuntu系统中体验 + +命令行一键体验 + +```shell +paddlespeech st --input en.wav ``` -paddlespeech asr --input ./zh.wav | paddlespeech text --task punc + +python API 一键预测 + +```python +>>> from paddlespeech.cli.st.infer import STExecutor +>>> st = STExecutor() +>>> result = st(audio_file="en.wav") +['我 在 这栋 建筑 的 古老 门上 敲门 。'] ``` -更多命令行命令请参考 [demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos) -> Note: 如果需要训练或者微调,请查看[语音识别](./docs/source/asr/quick_start.md), [语音合成](./docs/source/tts/quick_start.md)。 +
+ + ## 快速使用服务 -安装完成后,开发者可以通过命令行快速使用服务。 +安装完成后,开发者可以通过命令行一键启动语音识别,语音合成,音频分类三种服务。 **启动服务** ```shell @@ -614,6 +751,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声 语音合成模块最初被称为 [Parakeet](https://github.com/PaddlePaddle/Parakeet),现在与此仓库合并。如果您对该任务的学术研究感兴趣,请参阅 [TTS 研究概述](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/docs/source/tts#overview)。此外,[模型介绍](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/tts/models_introduction.md) 是了解语音合成流程的一个很好的指南。 + ## ⭐ 应用案例 - **[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo): 使用 PaddleSpeech 的语音合成模块生成虚拟人的声音。** diff --git a/demos/speech_server/README.md b/demos/speech_server/README.md index 34f927fa49cfb3fe15ce1476392494ce61002a7f..fda8ab9c407a60d209f799a995a64a2127c4839b 100644 --- a/demos/speech_server/README.md +++ b/demos/speech_server/README.md @@ -5,14 +5,19 @@ ## Introduction This demo is an implementation of starting the voice service and accessing the service. It can be achieved with a single command using `paddlespeech_server` and `paddlespeech_client` or a few lines of code in python. +For service interface definition, please check: +- [PaddleSpeech Server RESTful API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-RESTful-API) + ## Usage ### 1. Installation see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). -It is recommended to use **paddlepaddle 2.2.2** or above. +It is recommended to use **paddlepaddle 2.3.1** or above. + You can choose one way from easy, meduim and hard to install paddlespeech. -**If you install in simple mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.** + +**If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.** ### 2. Prepare config File The configuration file can be found in `conf/application.yaml` . @@ -47,7 +52,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee - `log_file`: log file. Default: ./log/paddlespeech.log Output: - ```bash + ```text [2022-02-23 11:17:32] [INFO] [server.py:64] Started server process [6384] INFO: Waiting for application startup. [2022-02-23 11:17:32] [INFO] [on.py:26] Waiting for application startup. @@ -55,7 +60,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee [2022-02-23 11:17:32] [INFO] [on.py:38] Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) [2022-02-23 11:17:32] [INFO] [server.py:204] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) - ``` - Python API @@ -69,7 +73,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ``` Output: - ```bash + ```text INFO: Started server process [529] [2022-02-23 14:57:56] [INFO] [server.py:64] Started server process [529] INFO: Waiting for application startup. @@ -78,7 +82,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee [2022-02-23 14:57:56] [INFO] [on.py:38] Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) [2022-02-23 14:57:56] [INFO] [server.py:204] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) - ``` @@ -106,7 +109,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee - `audio_format`: Audio format. Default: "wav". Output: - ```bash + ```text [2022-02-23 18:11:22,819] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'transcription': '我认为跑步最重要的就是给我带来了身体健康'}} [2022-02-23 18:11:22,820] [ INFO] - time cost 0.689145 s. @@ -129,7 +132,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ``` Output: - ```bash + ```text {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'transcription': '我认为跑步最重要的就是给我带来了身体健康'}} ``` @@ -158,12 +161,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee - `output`: Output wave filepath. Default: None, which means not to save the audio to the local. Output: - ```bash + ```text [2022-02-23 15:20:37,875] [ INFO] - {'description': 'success.'} [2022-02-23 15:20:37,875] [ INFO] - Save synthesized audio successfully on output.wav. [2022-02-23 15:20:37,875] [ INFO] - Audio duration: 3.612500 s. [2022-02-23 15:20:37,875] [ INFO] - Response time: 0.348050 s. - ``` - Python API @@ -189,11 +191,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ``` Output: - ```bash + ```text {'description': 'success.'} Save synthesized audio successfully on ./output.wav. Audio duration: 3.612500 s. - ``` ### 6. CLS Client Usage @@ -202,7 +203,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee If `127.0.0.1` is not accessible, you need to use the actual service IP address. - ``` + ```bash paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input ./zh.wav ``` @@ -218,11 +219,9 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee - `topk`: topk scores of classification result. Output: - ```bash + ```text [2022-03-09 20:44:39,974] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}} [2022-03-09 20:44:39,975] [ INFO] - Response time 0.104360 s. - - ``` - Python API @@ -240,9 +239,8 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ``` Output: - ```bash + ```text {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}} - ``` @@ -274,13 +272,13 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee Output: - ```bash - [2022-05-25 12:25:36,165] [ INFO] - vector http client start - [2022-05-25 12:25:36,165] [ INFO] - the input audio: 85236145389.wav - [2022-05-25 12:25:36,165] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector - [2022-05-25 12:25:36,166] [ INFO] - http://127.0.0.1:8790/paddlespeech/vector - [2022-05-25 12:25:36,324] [ INFO] - The vector: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [-1.3251205682754517, 7.860682487487793, -4.620625972747803, 0.3000721037387848, 2.2648534774780273, -1.1931440830230713, 3.064713716506958, 7.673594951629639, -6.004472732543945, -12.024259567260742, -1.9496068954467773, 3.126953601837158, 1.6188379526138306, -7.638310432434082, -1.2299772500991821, -12.33833122253418, 2.1373026371002197, -5.395712375640869, 9.717328071594238, 5.675230503082275, 3.7805123329162598, 3.0597171783447266, 3.429692029953003, 8.9760103225708, 13.174124717712402, -0.5313228368759155, 8.942471504211426, 4.465109825134277, -4.426247596740723, -9.726503372192383, 8.399328231811523, 7.223917484283447, -7.435853958129883, 2.9441683292388916, -4.343039512634277, -13.886964797973633, -1.6346734762191772, -10.902740478515625, -5.311244964599609, 3.800722122192383, 3.897603750228882, -2.123077392578125, -2.3521194458007812, 4.151031017303467, -7.404866695404053, 0.13911646604537964, 2.4626107215881348, 4.96645450592041, 0.9897574186325073, 5.483975410461426, -3.3574001789093018, 10.13400650024414, -0.6120170950889587, -10.403095245361328, 4.600754261016846, 16.009349822998047, -7.78369140625, -4.194530487060547, -6.93686056137085, 1.1789555549621582, 11.490800857543945, 4.23802375793457, 9.550930976867676, 8.375045776367188, 7.508914470672607, -0.6570729613304138, -0.3005157709121704, 2.8406054973602295, 3.0828027725219727, 0.7308170199394226, 6.1483540534973145, 0.1376611888408661, -13.424735069274902, -7.746140480041504, -2.322798252105713, -8.305252075195312, 2.98791241645813, -10.99522876739502, 0.15211068093776703, -2.3820347785949707, -1.7984174489974976, 8.49562931060791, -5.852236747741699, -3.755497932434082, 0.6989710927009583, -5.270299434661865, -2.6188621520996094, -1.8828465938568115, -4.6466498374938965, 14.078543663024902, -0.5495333075523376, 10.579157829284668, -3.216050148010254, 9.349003791809082, -4.381077766418457, -11.675816535949707, -2.863020658493042, 4.5721755027771, 2.246612071990967, -4.574341773986816, 1.8610187768936157, 2.3767874240875244, 5.625787734985352, -9.784077644348145, 0.6496725678443909, -1.457950472831726, 0.4263263940811157, -4.921126365661621, -2.4547839164733887, 3.4869801998138428, -0.4265422224998474, 8.341268539428711, 1.356552004814148, 7.096688270568848, -13.102828979492188, 8.01673412322998, -7.115934371948242, 1.8699780702590942, 0.20872099697589874, 14.699383735656738, -1.0252779722213745, -2.6107232570648193, -2.5082311630249023, 8.427192687988281, 6.913852691650391, -6.29124641418457, 0.6157366037368774, 2.489687919616699, -3.4668266773223877, 9.92176342010498, 11.200815200805664, -0.19664029777050018, 7.491600513458252, -0.6231271624565125, -0.2584814429283142, -9.947997093200684, -0.9611040949821472, 1.1649218797683716, -2.1907122135162354, -1.502848744392395, -0.5192610621452332, 15.165953636169434, 2.4649462699890137, -0.998044490814209, 7.44166374206543, -2.0768048763275146, 3.5896823406219482, -7.305543422698975, -7.562084674835205, 4.32333517074585, 0.08044180274009705, -6.564010143280029, -2.314805269241333, -1.7642345428466797, -2.470881700515747, -7.6756181716918945, -9.548877716064453, -1.017755389213562, 0.1698644608259201, 2.5877134799957275, -1.8752295970916748, -0.36614322662353516, -6.049378395080566, -2.3965611457824707, -5.945338726043701, 0.9424033164978027, -13.155974388122559, -7.45780086517334, 0.14658108353614807, -3.7427968978881836, 5.841492652893066, -1.2872905731201172, 5.569431304931641, 12.570590019226074, 1.0939218997955322, 2.2142086029052734, 1.9181575775146484, 6.991420745849609, -5.888138771057129, 3.1409823894500732, -2.0036280155181885, 2.4434285163879395, 9.973138809204102, 5.036680221557617, 2.005120277404785, 2.861560344696045, 5.860223770141602, 2.917618751525879, -1.63111412525177, 2.0292205810546875, -4.070415019989014, -6.831437110900879]}} - [2022-05-25 12:25:36,324] [ INFO] - Response time 0.159053 s. + ```text + [2022-05-25 12:25:36,165] [ INFO] - vector http client start + [2022-05-25 12:25:36,165] [ INFO] - the input audio: 85236145389.wav + [2022-05-25 12:25:36,165] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector + [2022-05-25 12:25:36,166] [ INFO] - http://127.0.0.1:8790/paddlespeech/vector + [2022-05-25 12:25:36,324] [ INFO] - The vector: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [-1.3251205682754517, 7.860682487487793, -4.620625972747803, 0.3000721037387848, 2.2648534774780273, -1.1931440830230713, 3.064713716506958, 7.673594951629639, -6.004472732543945, -12.024259567260742, -1.9496068954467773, 3.126953601837158, 1.6188379526138306, -7.638310432434082, -1.2299772500991821, -12.33833122253418, 2.1373026371002197, -5.395712375640869, 9.717328071594238, 5.675230503082275, 3.7805123329162598, 3.0597171783447266, 3.429692029953003, 8.9760103225708, 13.174124717712402, -0.5313228368759155, 8.942471504211426, 4.465109825134277, -4.426247596740723, -9.726503372192383, 8.399328231811523, 7.223917484283447, -7.435853958129883, 2.9441683292388916, -4.343039512634277, -13.886964797973633, -1.6346734762191772, -10.902740478515625, -5.311244964599609, 3.800722122192383, 3.897603750228882, -2.123077392578125, -2.3521194458007812, 4.151031017303467, -7.404866695404053, 0.13911646604537964, 2.4626107215881348, 4.96645450592041, 0.9897574186325073, 5.483975410461426, -3.3574001789093018, 10.13400650024414, -0.6120170950889587, -10.403095245361328, 4.600754261016846, 16.009349822998047, -7.78369140625, -4.194530487060547, -6.93686056137085, 1.1789555549621582, 11.490800857543945, 4.23802375793457, 9.550930976867676, 8.375045776367188, 7.508914470672607, -0.6570729613304138, -0.3005157709121704, 2.8406054973602295, 3.0828027725219727, 0.7308170199394226, 6.1483540534973145, 0.1376611888408661, -13.424735069274902, -7.746140480041504, -2.322798252105713, -8.305252075195312, 2.98791241645813, -10.99522876739502, 0.15211068093776703, -2.3820347785949707, -1.7984174489974976, 8.49562931060791, -5.852236747741699, -3.755497932434082, 0.6989710927009583, -5.270299434661865, -2.6188621520996094, -1.8828465938568115, -4.6466498374938965, 14.078543663024902, -0.5495333075523376, 10.579157829284668, -3.216050148010254, 9.349003791809082, -4.381077766418457, -11.675816535949707, -2.863020658493042, 4.5721755027771, 2.246612071990967, -4.574341773986816, 1.8610187768936157, 2.3767874240875244, 5.625787734985352, -9.784077644348145, 0.6496725678443909, -1.457950472831726, 0.4263263940811157, -4.921126365661621, -2.4547839164733887, 3.4869801998138428, -0.4265422224998474, 8.341268539428711, 1.356552004814148, 7.096688270568848, -13.102828979492188, 8.01673412322998, -7.115934371948242, 1.8699780702590942, 0.20872099697589874, 14.699383735656738, -1.0252779722213745, -2.6107232570648193, -2.5082311630249023, 8.427192687988281, 6.913852691650391, -6.29124641418457, 0.6157366037368774, 2.489687919616699, -3.4668266773223877, 9.92176342010498, 11.200815200805664, -0.19664029777050018, 7.491600513458252, -0.6231271624565125, -0.2584814429283142, -9.947997093200684, -0.9611040949821472, 1.1649218797683716, -2.1907122135162354, -1.502848744392395, -0.5192610621452332, 15.165953636169434, 2.4649462699890137, -0.998044490814209, 7.44166374206543, -2.0768048763275146, 3.5896823406219482, -7.305543422698975, -7.562084674835205, 4.32333517074585, 0.08044180274009705, -6.564010143280029, -2.314805269241333, -1.7642345428466797, -2.470881700515747, -7.6756181716918945, -9.548877716064453, -1.017755389213562, 0.1698644608259201, 2.5877134799957275, -1.8752295970916748, -0.36614322662353516, -6.049378395080566, -2.3965611457824707, -5.945338726043701, 0.9424033164978027, -13.155974388122559, -7.45780086517334, 0.14658108353614807, -3.7427968978881836, 5.841492652893066, -1.2872905731201172, 5.569431304931641, 12.570590019226074, 1.0939218997955322, 2.2142086029052734, 1.9181575775146484, 6.991420745849609, -5.888138771057129, 3.1409823894500732, -2.0036280155181885, 2.4434285163879395, 9.973138809204102, 5.036680221557617, 2.005120277404785, 2.861560344696045, 5.860223770141602, 2.917618751525879, -1.63111412525177, 2.0292205810546875, -4.070415019989014, -6.831437110900879]}} + [2022-05-25 12:25:36,324] [ INFO] - Response time 0.159053 s. ``` * Python API @@ -299,8 +297,8 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee Output: - ``` bash - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [-1.3251205682754517, 7.860682487487793, -4.620625972747803, 0.3000721037387848, 2.2648534774780273, -1.1931440830230713, 3.064713716506958, 7.673594951629639, -6.004472732543945, -12.024259567260742, -1.9496068954467773, 3.126953601837158, 1.6188379526138306, -7.638310432434082, -1.2299772500991821, -12.33833122253418, 2.1373026371002197, -5.395712375640869, 9.717328071594238, 5.675230503082275, 3.7805123329162598, 3.0597171783447266, 3.429692029953003, 8.9760103225708, 13.174124717712402, -0.5313228368759155, 8.942471504211426, 4.465109825134277, -4.426247596740723, -9.726503372192383, 8.399328231811523, 7.223917484283447, -7.435853958129883, 2.9441683292388916, -4.343039512634277, -13.886964797973633, -1.6346734762191772, -10.902740478515625, -5.311244964599609, 3.800722122192383, 3.897603750228882, -2.123077392578125, -2.3521194458007812, 4.151031017303467, -7.404866695404053, 0.13911646604537964, 2.4626107215881348, 4.96645450592041, 0.9897574186325073, 5.483975410461426, -3.3574001789093018, 10.13400650024414, -0.6120170950889587, -10.403095245361328, 4.600754261016846, 16.009349822998047, -7.78369140625, -4.194530487060547, -6.93686056137085, 1.1789555549621582, 11.490800857543945, 4.23802375793457, 9.550930976867676, 8.375045776367188, 7.508914470672607, -0.6570729613304138, -0.3005157709121704, 2.8406054973602295, 3.0828027725219727, 0.7308170199394226, 6.1483540534973145, 0.1376611888408661, -13.424735069274902, -7.746140480041504, -2.322798252105713, -8.305252075195312, 2.98791241645813, -10.99522876739502, 0.15211068093776703, -2.3820347785949707, -1.7984174489974976, 8.49562931060791, -5.852236747741699, -3.755497932434082, 0.6989710927009583, -5.270299434661865, -2.6188621520996094, -1.8828465938568115, -4.6466498374938965, 14.078543663024902, -0.5495333075523376, 10.579157829284668, -3.216050148010254, 9.349003791809082, -4.381077766418457, -11.675816535949707, -2.863020658493042, 4.5721755027771, 2.246612071990967, -4.574341773986816, 1.8610187768936157, 2.3767874240875244, 5.625787734985352, -9.784077644348145, 0.6496725678443909, -1.457950472831726, 0.4263263940811157, -4.921126365661621, -2.4547839164733887, 3.4869801998138428, -0.4265422224998474, 8.341268539428711, 1.356552004814148, 7.096688270568848, -13.102828979492188, 8.01673412322998, -7.115934371948242, 1.8699780702590942, 0.20872099697589874, 14.699383735656738, -1.0252779722213745, -2.6107232570648193, -2.5082311630249023, 8.427192687988281, 6.913852691650391, -6.29124641418457, 0.6157366037368774, 2.489687919616699, -3.4668266773223877, 9.92176342010498, 11.200815200805664, -0.19664029777050018, 7.491600513458252, -0.6231271624565125, -0.2584814429283142, -9.947997093200684, -0.9611040949821472, 1.1649218797683716, -2.1907122135162354, -1.502848744392395, -0.5192610621452332, 15.165953636169434, 2.4649462699890137, -0.998044490814209, 7.44166374206543, -2.0768048763275146, 3.5896823406219482, -7.305543422698975, -7.562084674835205, 4.32333517074585, 0.08044180274009705, -6.564010143280029, -2.314805269241333, -1.7642345428466797, -2.470881700515747, -7.6756181716918945, -9.548877716064453, -1.017755389213562, 0.1698644608259201, 2.5877134799957275, -1.8752295970916748, -0.36614322662353516, -6.049378395080566, -2.3965611457824707, -5.945338726043701, 0.9424033164978027, -13.155974388122559, -7.45780086517334, 0.14658108353614807, -3.7427968978881836, 5.841492652893066, -1.2872905731201172, 5.569431304931641, 12.570590019226074, 1.0939218997955322, 2.2142086029052734, 1.9181575775146484, 6.991420745849609, -5.888138771057129, 3.1409823894500732, -2.0036280155181885, 2.4434285163879395, 9.973138809204102, 5.036680221557617, 2.005120277404785, 2.861560344696045, 5.860223770141602, 2.917618751525879, -1.63111412525177, 2.0292205810546875, -4.070415019989014, -6.831437110900879]}} + ```text + {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [-1.3251205682754517, 7.860682487487793, -4.620625972747803, 0.3000721037387848, 2.2648534774780273, -1.1931440830230713, 3.064713716506958, 7.673594951629639, -6.004472732543945, -12.024259567260742, -1.9496068954467773, 3.126953601837158, 1.6188379526138306, -7.638310432434082, -1.2299772500991821, -12.33833122253418, 2.1373026371002197, -5.395712375640869, 9.717328071594238, 5.675230503082275, 3.7805123329162598, 3.0597171783447266, 3.429692029953003, 8.9760103225708, 13.174124717712402, -0.5313228368759155, 8.942471504211426, 4.465109825134277, -4.426247596740723, -9.726503372192383, 8.399328231811523, 7.223917484283447, -7.435853958129883, 2.9441683292388916, -4.343039512634277, -13.886964797973633, -1.6346734762191772, -10.902740478515625, -5.311244964599609, 3.800722122192383, 3.897603750228882, -2.123077392578125, -2.3521194458007812, 4.151031017303467, -7.404866695404053, 0.13911646604537964, 2.4626107215881348, 4.96645450592041, 0.9897574186325073, 5.483975410461426, -3.3574001789093018, 10.13400650024414, -0.6120170950889587, -10.403095245361328, 4.600754261016846, 16.009349822998047, -7.78369140625, -4.194530487060547, -6.93686056137085, 1.1789555549621582, 11.490800857543945, 4.23802375793457, 9.550930976867676, 8.375045776367188, 7.508914470672607, -0.6570729613304138, -0.3005157709121704, 2.8406054973602295, 3.0828027725219727, 0.7308170199394226, 6.1483540534973145, 0.1376611888408661, -13.424735069274902, -7.746140480041504, -2.322798252105713, -8.305252075195312, 2.98791241645813, -10.99522876739502, 0.15211068093776703, -2.3820347785949707, -1.7984174489974976, 8.49562931060791, -5.852236747741699, -3.755497932434082, 0.6989710927009583, -5.270299434661865, -2.6188621520996094, -1.8828465938568115, -4.6466498374938965, 14.078543663024902, -0.5495333075523376, 10.579157829284668, -3.216050148010254, 9.349003791809082, -4.381077766418457, -11.675816535949707, -2.863020658493042, 4.5721755027771, 2.246612071990967, -4.574341773986816, 1.8610187768936157, 2.3767874240875244, 5.625787734985352, -9.784077644348145, 0.6496725678443909, -1.457950472831726, 0.4263263940811157, -4.921126365661621, -2.4547839164733887, 3.4869801998138428, -0.4265422224998474, 8.341268539428711, 1.356552004814148, 7.096688270568848, -13.102828979492188, 8.01673412322998, -7.115934371948242, 1.8699780702590942, 0.20872099697589874, 14.699383735656738, -1.0252779722213745, -2.6107232570648193, -2.5082311630249023, 8.427192687988281, 6.913852691650391, -6.29124641418457, 0.6157366037368774, 2.489687919616699, -3.4668266773223877, 9.92176342010498, 11.200815200805664, -0.19664029777050018, 7.491600513458252, -0.6231271624565125, -0.2584814429283142, -9.947997093200684, -0.9611040949821472, 1.1649218797683716, -2.1907122135162354, -1.502848744392395, -0.5192610621452332, 15.165953636169434, 2.4649462699890137, -0.998044490814209, 7.44166374206543, -2.0768048763275146, 3.5896823406219482, -7.305543422698975, -7.562084674835205, 4.32333517074585, 0.08044180274009705, -6.564010143280029, -2.314805269241333, -1.7642345428466797, -2.470881700515747, -7.6756181716918945, -9.548877716064453, -1.017755389213562, 0.1698644608259201, 2.5877134799957275, -1.8752295970916748, -0.36614322662353516, -6.049378395080566, -2.3965611457824707, -5.945338726043701, 0.9424033164978027, -13.155974388122559, -7.45780086517334, 0.14658108353614807, -3.7427968978881836, 5.841492652893066, -1.2872905731201172, 5.569431304931641, 12.570590019226074, 1.0939218997955322, 2.2142086029052734, 1.9181575775146484, 6.991420745849609, -5.888138771057129, 3.1409823894500732, -2.0036280155181885, 2.4434285163879395, 9.973138809204102, 5.036680221557617, 2.005120277404785, 2.861560344696045, 5.860223770141602, 2.917618751525879, -1.63111412525177, 2.0292205810546875, -4.070415019989014, -6.831437110900879]}} ``` #### 7.2 Get the score between speaker audio embedding @@ -331,13 +329,13 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee Output: - ``` bash - [2022-05-25 12:33:24,527] [ INFO] - vector score http client start - [2022-05-25 12:33:24,527] [ INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav - [2022-05-25 12:33:24,528] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector/score - [2022-05-25 12:33:24,695] [ INFO] - The vector score is: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}} - [2022-05-25 12:33:24,696] [ INFO] - The vector: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}} - [2022-05-25 12:33:24,696] [ INFO] - Response time 0.168271 s. + ```text + [2022-05-25 12:33:24,527] [ INFO] - vector score http client start + [2022-05-25 12:33:24,527] [ INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav + [2022-05-25 12:33:24,528] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector/score + [2022-05-25 12:33:24,695] [ INFO] - The vector score is: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}} + [2022-05-25 12:33:24,696] [ INFO] - The vector: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}} + [2022-05-25 12:33:24,696] [ INFO] - Response time 0.168271 s. ``` * Python API @@ -358,7 +356,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee Output: - ``` bash + ```text [2022-05-25 12:30:14,143] [ INFO] - vector score http client start [2022-05-25 12:30:14,143] [ INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav [2022-05-25 12:30:14,143] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector/score @@ -389,9 +387,9 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee - `input`(required): Input text to get punctuation. Output: - ```bash - [2022-05-09 18:19:04,397] [ INFO] - The punc text: 我认为跑步最重要的就是给我带来了身体健康。 - [2022-05-09 18:19:04,397] [ INFO] - Response time 0.092407 s. + ```text + [2022-05-09 18:19:04,397] [ INFO] - The punc text: 我认为跑步最重要的就是给我带来了身体健康。 + [2022-05-09 18:19:04,397] [ INFO] - Response time 0.092407 s. ``` - Python API @@ -408,11 +406,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ``` Output: - ```bash + ```text 我认为跑步最重要的就是给我带来了身体健康。 ``` - ## Models supported by the service ### ASR model Get all models supported by the ASR service via `paddlespeech_server stats --task asr`, where static models can be used for paddle inference inference. diff --git a/demos/speech_server/README_cn.md b/demos/speech_server/README_cn.md index bcbb2cbe2764f1fef2c6d925620ee337b7606716..247bde4646c272215e39ecfc4218cc518a7b8cd9 100644 --- a/demos/speech_server/README_cn.md +++ b/demos/speech_server/README_cn.md @@ -3,22 +3,29 @@ # 语音服务 ## 介绍 -这个 demo 是一个启动离线语音服务和访问服务的实现。它可以通过使用`paddlespeech_server` 和 `paddlespeech_client` 的单个命令或 python 的几行代码来实现。 +这个 demo 是一个启动离线语音服务和访问服务的实现。它可以通过使用 `paddlespeech_server` 和 `paddlespeech_client` 的单个命令或 python 的几行代码来实现。 + + +服务接口定义请参考: +- [PaddleSpeech Server RESTful API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-RESTful-API) ## 使用方法 ### 1. 安装 请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). -推荐使用 **paddlepaddle 2.2.2** 或以上版本。 -你可以从简单,中等,困难几种方式中选择一种方式安装 PaddleSpeech。 -**如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。** +推荐使用 **paddlepaddle 2.3.1** 或以上版本。 +你可以从简单,中等,困难 几种方式中选择一种方式安装 PaddleSpeech。 + +**如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。** ### 2. 准备配置文件 配置文件可参见 `conf/application.yaml` 。 -其中,`engine_list`表示即将启动的服务将会包含的语音引擎,格式为 <语音任务>_<引擎类型>。 -目前服务集成的语音任务有: asr(语音识别)、tts(语音合成)、cls(音频分类)、vector(声纹识别)以及 text(文本处理)。 +其中,`engine_list` 表示即将启动的服务将会包含的语音引擎,格式为 <语音任务>_<引擎类型>。 + +目前服务集成的语音任务有: asr (语音识别)、tts (语音合成)、cls (音频分类)、vector (声纹识别)以及 text (文本处理)。 + 目前引擎类型支持两种形式:python 及 inference (Paddle Inference) **注意:** 如果在容器里可正常启动服务,但客户端访问 ip 不可达,可尝试将配置文件中 `host` 地址换成本地 ip 地址。 @@ -48,7 +55,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee - `log_file`: log 文件. 默认:./log/paddlespeech.log 输出: - ```bash + ```text [2022-02-23 11:17:32] [INFO] [server.py:64] Started server process [6384] INFO: Waiting for application startup. [2022-02-23 11:17:32] [INFO] [on.py:26] Waiting for application startup. @@ -56,7 +63,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee [2022-02-23 11:17:32] [INFO] [on.py:38] Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) [2022-02-23 11:17:32] [INFO] [server.py:204] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) - ``` - Python API @@ -70,7 +76,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ``` 输出: - ```bash + ```text INFO: Started server process [529] [2022-02-23 14:57:56] [INFO] [server.py:64] Started server process [529] INFO: Waiting for application startup. @@ -79,7 +85,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee [2022-02-23 14:57:56] [INFO] [on.py:38] Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) [2022-02-23 14:57:56] [INFO] [server.py:204] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) - ``` ### 4. ASR 客户端使用方法 @@ -108,8 +113,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee - `audio_format`: 音频格式,默认值:wav。 输出: - - ```bash + ```text [2022-02-23 18:11:22,819] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'transcription': '我认为跑步最重要的就是给我带来了身体健康'}} [2022-02-23 18:11:22,820] [ INFO] - time cost 0.689145 s. ``` @@ -131,9 +135,8 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ``` 输出: - ```bash + ```text {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'transcription': '我认为跑步最重要的就是给我带来了身体健康'}} - ``` ### 5. TTS 客户端使用方法 @@ -162,7 +165,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee - `output`: 输出音频的路径, 默认值:None,表示不保存音频到本地。 输出: - ```bash + ```text [2022-02-23 15:20:37,875] [ INFO] - {'description': 'success.'} [2022-02-23 15:20:37,875] [ INFO] - Save synthesized audio successfully on output.wav. [2022-02-23 15:20:37,875] [ INFO] - Audio duration: 3.612500 s. @@ -192,11 +195,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ``` 输出: - ```bash + ```text {'description': 'success.'} Save synthesized audio successfully on ./output.wav. Audio duration: 3.612500 s. - ``` ### 6. CLS 客户端使用方法 @@ -207,7 +209,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee 若 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址 - ``` + ```bash paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input ./zh.wav ``` @@ -223,11 +225,9 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee - `topk`: 分类结果的topk。 输出: - ```bash + ```text [2022-03-09 20:44:39,974] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}} [2022-03-09 20:44:39,975] [ INFO] - Response time 0.104360 s. - - ``` - Python API @@ -242,13 +242,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee port=8090, topk=1) print(res.json()) - ``` 输出: - ```bash + ```text {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}} - ``` ### 7. 声纹客户端使用方法 @@ -259,7 +257,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee 若 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址 - ``` bash + ```bash paddlespeech_client vector --task spk --server_ip 127.0.0.1 --port 8090 --input 85236145389.wav ``` @@ -275,15 +273,15 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee * task: vector 的任务,可选spk或者score。默认是 spk。 * enroll: 注册音频;。 * test: 测试音频。 - 输出: - ``` bash - [2022-05-25 12:25:36,165] [ INFO] - vector http client start - [2022-05-25 12:25:36,165] [ INFO] - the input audio: 85236145389.wav - [2022-05-25 12:25:36,165] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector - [2022-05-25 12:25:36,166] [ INFO] - http://127.0.0.1:8790/paddlespeech/vector - [2022-05-25 12:25:36,324] [ INFO] - The vector: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [-1.3251205682754517, 7.860682487487793, -4.620625972747803, 0.3000721037387848, 2.2648534774780273, -1.1931440830230713, 3.064713716506958, 7.673594951629639, -6.004472732543945, -12.024259567260742, -1.9496068954467773, 3.126953601837158, 1.6188379526138306, -7.638310432434082, -1.2299772500991821, -12.33833122253418, 2.1373026371002197, -5.395712375640869, 9.717328071594238, 5.675230503082275, 3.7805123329162598, 3.0597171783447266, 3.429692029953003, 8.9760103225708, 13.174124717712402, -0.5313228368759155, 8.942471504211426, 4.465109825134277, -4.426247596740723, -9.726503372192383, 8.399328231811523, 7.223917484283447, -7.435853958129883, 2.9441683292388916, -4.343039512634277, -13.886964797973633, -1.6346734762191772, -10.902740478515625, -5.311244964599609, 3.800722122192383, 3.897603750228882, -2.123077392578125, -2.3521194458007812, 4.151031017303467, -7.404866695404053, 0.13911646604537964, 2.4626107215881348, 4.96645450592041, 0.9897574186325073, 5.483975410461426, -3.3574001789093018, 10.13400650024414, -0.6120170950889587, -10.403095245361328, 4.600754261016846, 16.009349822998047, -7.78369140625, -4.194530487060547, -6.93686056137085, 1.1789555549621582, 11.490800857543945, 4.23802375793457, 9.550930976867676, 8.375045776367188, 7.508914470672607, -0.6570729613304138, -0.3005157709121704, 2.8406054973602295, 3.0828027725219727, 0.7308170199394226, 6.1483540534973145, 0.1376611888408661, -13.424735069274902, -7.746140480041504, -2.322798252105713, -8.305252075195312, 2.98791241645813, -10.99522876739502, 0.15211068093776703, -2.3820347785949707, -1.7984174489974976, 8.49562931060791, -5.852236747741699, -3.755497932434082, 0.6989710927009583, -5.270299434661865, -2.6188621520996094, -1.8828465938568115, -4.6466498374938965, 14.078543663024902, -0.5495333075523376, 10.579157829284668, -3.216050148010254, 9.349003791809082, -4.381077766418457, -11.675816535949707, -2.863020658493042, 4.5721755027771, 2.246612071990967, -4.574341773986816, 1.8610187768936157, 2.3767874240875244, 5.625787734985352, -9.784077644348145, 0.6496725678443909, -1.457950472831726, 0.4263263940811157, -4.921126365661621, -2.4547839164733887, 3.4869801998138428, -0.4265422224998474, 8.341268539428711, 1.356552004814148, 7.096688270568848, -13.102828979492188, 8.01673412322998, -7.115934371948242, 1.8699780702590942, 0.20872099697589874, 14.699383735656738, -1.0252779722213745, -2.6107232570648193, -2.5082311630249023, 8.427192687988281, 6.913852691650391, -6.29124641418457, 0.6157366037368774, 2.489687919616699, -3.4668266773223877, 9.92176342010498, 11.200815200805664, -0.19664029777050018, 7.491600513458252, -0.6231271624565125, -0.2584814429283142, -9.947997093200684, -0.9611040949821472, 1.1649218797683716, -2.1907122135162354, -1.502848744392395, -0.5192610621452332, 15.165953636169434, 2.4649462699890137, -0.998044490814209, 7.44166374206543, -2.0768048763275146, 3.5896823406219482, -7.305543422698975, -7.562084674835205, 4.32333517074585, 0.08044180274009705, -6.564010143280029, -2.314805269241333, -1.7642345428466797, -2.470881700515747, -7.6756181716918945, -9.548877716064453, -1.017755389213562, 0.1698644608259201, 2.5877134799957275, -1.8752295970916748, -0.36614322662353516, -6.049378395080566, -2.3965611457824707, -5.945338726043701, 0.9424033164978027, -13.155974388122559, -7.45780086517334, 0.14658108353614807, -3.7427968978881836, 5.841492652893066, -1.2872905731201172, 5.569431304931641, 12.570590019226074, 1.0939218997955322, 2.2142086029052734, 1.9181575775146484, 6.991420745849609, -5.888138771057129, 3.1409823894500732, -2.0036280155181885, 2.4434285163879395, 9.973138809204102, 5.036680221557617, 2.005120277404785, 2.861560344696045, 5.860223770141602, 2.917618751525879, -1.63111412525177, 2.0292205810546875, -4.070415019989014, -6.831437110900879]}} - [2022-05-25 12:25:36,324] [ INFO] - Response time 0.159053 s. + 输出: + ```text + [2022-05-25 12:25:36,165] [ INFO] - vector http client start + [2022-05-25 12:25:36,165] [ INFO] - the input audio: 85236145389.wav + [2022-05-25 12:25:36,165] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector + [2022-05-25 12:25:36,166] [ INFO] - http://127.0.0.1:8790/paddlespeech/vector + [2022-05-25 12:25:36,324] [ INFO] - The vector: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [-1.3251205682754517, 7.860682487487793, -4.620625972747803, 0.3000721037387848, 2.2648534774780273, -1.1931440830230713, 3.064713716506958, 7.673594951629639, -6.004472732543945, -12.024259567260742, -1.9496068954467773, 3.126953601837158, 1.6188379526138306, -7.638310432434082, -1.2299772500991821, -12.33833122253418, 2.1373026371002197, -5.395712375640869, 9.717328071594238, 5.675230503082275, 3.7805123329162598, 3.0597171783447266, 3.429692029953003, 8.9760103225708, 13.174124717712402, -0.5313228368759155, 8.942471504211426, 4.465109825134277, -4.426247596740723, -9.726503372192383, 8.399328231811523, 7.223917484283447, -7.435853958129883, 2.9441683292388916, -4.343039512634277, -13.886964797973633, -1.6346734762191772, -10.902740478515625, -5.311244964599609, 3.800722122192383, 3.897603750228882, -2.123077392578125, -2.3521194458007812, 4.151031017303467, -7.404866695404053, 0.13911646604537964, 2.4626107215881348, 4.96645450592041, 0.9897574186325073, 5.483975410461426, -3.3574001789093018, 10.13400650024414, -0.6120170950889587, -10.403095245361328, 4.600754261016846, 16.009349822998047, -7.78369140625, -4.194530487060547, -6.93686056137085, 1.1789555549621582, 11.490800857543945, 4.23802375793457, 9.550930976867676, 8.375045776367188, 7.508914470672607, -0.6570729613304138, -0.3005157709121704, 2.8406054973602295, 3.0828027725219727, 0.7308170199394226, 6.1483540534973145, 0.1376611888408661, -13.424735069274902, -7.746140480041504, -2.322798252105713, -8.305252075195312, 2.98791241645813, -10.99522876739502, 0.15211068093776703, -2.3820347785949707, -1.7984174489974976, 8.49562931060791, -5.852236747741699, -3.755497932434082, 0.6989710927009583, -5.270299434661865, -2.6188621520996094, -1.8828465938568115, -4.6466498374938965, 14.078543663024902, -0.5495333075523376, 10.579157829284668, -3.216050148010254, 9.349003791809082, -4.381077766418457, -11.675816535949707, -2.863020658493042, 4.5721755027771, 2.246612071990967, -4.574341773986816, 1.8610187768936157, 2.3767874240875244, 5.625787734985352, -9.784077644348145, 0.6496725678443909, -1.457950472831726, 0.4263263940811157, -4.921126365661621, -2.4547839164733887, 3.4869801998138428, -0.4265422224998474, 8.341268539428711, 1.356552004814148, 7.096688270568848, -13.102828979492188, 8.01673412322998, -7.115934371948242, 1.8699780702590942, 0.20872099697589874, 14.699383735656738, -1.0252779722213745, -2.6107232570648193, -2.5082311630249023, 8.427192687988281, 6.913852691650391, -6.29124641418457, 0.6157366037368774, 2.489687919616699, -3.4668266773223877, 9.92176342010498, 11.200815200805664, -0.19664029777050018, 7.491600513458252, -0.6231271624565125, -0.2584814429283142, -9.947997093200684, -0.9611040949821472, 1.1649218797683716, -2.1907122135162354, -1.502848744392395, -0.5192610621452332, 15.165953636169434, 2.4649462699890137, -0.998044490814209, 7.44166374206543, -2.0768048763275146, 3.5896823406219482, -7.305543422698975, -7.562084674835205, 4.32333517074585, 0.08044180274009705, -6.564010143280029, -2.314805269241333, -1.7642345428466797, -2.470881700515747, -7.6756181716918945, -9.548877716064453, -1.017755389213562, 0.1698644608259201, 2.5877134799957275, -1.8752295970916748, -0.36614322662353516, -6.049378395080566, -2.3965611457824707, -5.945338726043701, 0.9424033164978027, -13.155974388122559, -7.45780086517334, 0.14658108353614807, -3.7427968978881836, 5.841492652893066, -1.2872905731201172, 5.569431304931641, 12.570590019226074, 1.0939218997955322, 2.2142086029052734, 1.9181575775146484, 6.991420745849609, -5.888138771057129, 3.1409823894500732, -2.0036280155181885, 2.4434285163879395, 9.973138809204102, 5.036680221557617, 2.005120277404785, 2.861560344696045, 5.860223770141602, 2.917618751525879, -1.63111412525177, 2.0292205810546875, -4.070415019989014, -6.831437110900879]}} + [2022-05-25 12:25:36,324] [ INFO] - Response time 0.159053 s. ``` * Python API @@ -301,9 +299,8 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ``` 输出: - - ``` bash - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [-1.3251205682754517, 7.860682487487793, -4.620625972747803, 0.3000721037387848, 2.2648534774780273, -1.1931440830230713, 3.064713716506958, 7.673594951629639, -6.004472732543945, -12.024259567260742, -1.9496068954467773, 3.126953601837158, 1.6188379526138306, -7.638310432434082, -1.2299772500991821, -12.33833122253418, 2.1373026371002197, -5.395712375640869, 9.717328071594238, 5.675230503082275, 3.7805123329162598, 3.0597171783447266, 3.429692029953003, 8.9760103225708, 13.174124717712402, -0.5313228368759155, 8.942471504211426, 4.465109825134277, -4.426247596740723, -9.726503372192383, 8.399328231811523, 7.223917484283447, -7.435853958129883, 2.9441683292388916, -4.343039512634277, -13.886964797973633, -1.6346734762191772, -10.902740478515625, -5.311244964599609, 3.800722122192383, 3.897603750228882, -2.123077392578125, -2.3521194458007812, 4.151031017303467, -7.404866695404053, 0.13911646604537964, 2.4626107215881348, 4.96645450592041, 0.9897574186325073, 5.483975410461426, -3.3574001789093018, 10.13400650024414, -0.6120170950889587, -10.403095245361328, 4.600754261016846, 16.009349822998047, -7.78369140625, -4.194530487060547, -6.93686056137085, 1.1789555549621582, 11.490800857543945, 4.23802375793457, 9.550930976867676, 8.375045776367188, 7.508914470672607, -0.6570729613304138, -0.3005157709121704, 2.8406054973602295, 3.0828027725219727, 0.7308170199394226, 6.1483540534973145, 0.1376611888408661, -13.424735069274902, -7.746140480041504, -2.322798252105713, -8.305252075195312, 2.98791241645813, -10.99522876739502, 0.15211068093776703, -2.3820347785949707, -1.7984174489974976, 8.49562931060791, -5.852236747741699, -3.755497932434082, 0.6989710927009583, -5.270299434661865, -2.6188621520996094, -1.8828465938568115, -4.6466498374938965, 14.078543663024902, -0.5495333075523376, 10.579157829284668, -3.216050148010254, 9.349003791809082, -4.381077766418457, -11.675816535949707, -2.863020658493042, 4.5721755027771, 2.246612071990967, -4.574341773986816, 1.8610187768936157, 2.3767874240875244, 5.625787734985352, -9.784077644348145, 0.6496725678443909, -1.457950472831726, 0.4263263940811157, -4.921126365661621, -2.4547839164733887, 3.4869801998138428, -0.4265422224998474, 8.341268539428711, 1.356552004814148, 7.096688270568848, -13.102828979492188, 8.01673412322998, -7.115934371948242, 1.8699780702590942, 0.20872099697589874, 14.699383735656738, -1.0252779722213745, -2.6107232570648193, -2.5082311630249023, 8.427192687988281, 6.913852691650391, -6.29124641418457, 0.6157366037368774, 2.489687919616699, -3.4668266773223877, 9.92176342010498, 11.200815200805664, -0.19664029777050018, 7.491600513458252, -0.6231271624565125, -0.2584814429283142, -9.947997093200684, -0.9611040949821472, 1.1649218797683716, -2.1907122135162354, -1.502848744392395, -0.5192610621452332, 15.165953636169434, 2.4649462699890137, -0.998044490814209, 7.44166374206543, -2.0768048763275146, 3.5896823406219482, -7.305543422698975, -7.562084674835205, 4.32333517074585, 0.08044180274009705, -6.564010143280029, -2.314805269241333, -1.7642345428466797, -2.470881700515747, -7.6756181716918945, -9.548877716064453, -1.017755389213562, 0.1698644608259201, 2.5877134799957275, -1.8752295970916748, -0.36614322662353516, -6.049378395080566, -2.3965611457824707, -5.945338726043701, 0.9424033164978027, -13.155974388122559, -7.45780086517334, 0.14658108353614807, -3.7427968978881836, 5.841492652893066, -1.2872905731201172, 5.569431304931641, 12.570590019226074, 1.0939218997955322, 2.2142086029052734, 1.9181575775146484, 6.991420745849609, -5.888138771057129, 3.1409823894500732, -2.0036280155181885, 2.4434285163879395, 9.973138809204102, 5.036680221557617, 2.005120277404785, 2.861560344696045, 5.860223770141602, 2.917618751525879, -1.63111412525177, 2.0292205810546875, -4.070415019989014, -6.831437110900879]}} + ```text + {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [-1.3251205682754517, 7.860682487487793, -4.620625972747803, 0.3000721037387848, 2.2648534774780273, -1.1931440830230713, 3.064713716506958, 7.673594951629639, -6.004472732543945, -12.024259567260742, -1.9496068954467773, 3.126953601837158, 1.6188379526138306, -7.638310432434082, -1.2299772500991821, -12.33833122253418, 2.1373026371002197, -5.395712375640869, 9.717328071594238, 5.675230503082275, 3.7805123329162598, 3.0597171783447266, 3.429692029953003, 8.9760103225708, 13.174124717712402, -0.5313228368759155, 8.942471504211426, 4.465109825134277, -4.426247596740723, -9.726503372192383, 8.399328231811523, 7.223917484283447, -7.435853958129883, 2.9441683292388916, -4.343039512634277, -13.886964797973633, -1.6346734762191772, -10.902740478515625, -5.311244964599609, 3.800722122192383, 3.897603750228882, -2.123077392578125, -2.3521194458007812, 4.151031017303467, -7.404866695404053, 0.13911646604537964, 2.4626107215881348, 4.96645450592041, 0.9897574186325073, 5.483975410461426, -3.3574001789093018, 10.13400650024414, -0.6120170950889587, -10.403095245361328, 4.600754261016846, 16.009349822998047, -7.78369140625, -4.194530487060547, -6.93686056137085, 1.1789555549621582, 11.490800857543945, 4.23802375793457, 9.550930976867676, 8.375045776367188, 7.508914470672607, -0.6570729613304138, -0.3005157709121704, 2.8406054973602295, 3.0828027725219727, 0.7308170199394226, 6.1483540534973145, 0.1376611888408661, -13.424735069274902, -7.746140480041504, -2.322798252105713, -8.305252075195312, 2.98791241645813, -10.99522876739502, 0.15211068093776703, -2.3820347785949707, -1.7984174489974976, 8.49562931060791, -5.852236747741699, -3.755497932434082, 0.6989710927009583, -5.270299434661865, -2.6188621520996094, -1.8828465938568115, -4.6466498374938965, 14.078543663024902, -0.5495333075523376, 10.579157829284668, -3.216050148010254, 9.349003791809082, -4.381077766418457, -11.675816535949707, -2.863020658493042, 4.5721755027771, 2.246612071990967, -4.574341773986816, 1.8610187768936157, 2.3767874240875244, 5.625787734985352, -9.784077644348145, 0.6496725678443909, -1.457950472831726, 0.4263263940811157, -4.921126365661621, -2.4547839164733887, 3.4869801998138428, -0.4265422224998474, 8.341268539428711, 1.356552004814148, 7.096688270568848, -13.102828979492188, 8.01673412322998, -7.115934371948242, 1.8699780702590942, 0.20872099697589874, 14.699383735656738, -1.0252779722213745, -2.6107232570648193, -2.5082311630249023, 8.427192687988281, 6.913852691650391, -6.29124641418457, 0.6157366037368774, 2.489687919616699, -3.4668266773223877, 9.92176342010498, 11.200815200805664, -0.19664029777050018, 7.491600513458252, -0.6231271624565125, -0.2584814429283142, -9.947997093200684, -0.9611040949821472, 1.1649218797683716, -2.1907122135162354, -1.502848744392395, -0.5192610621452332, 15.165953636169434, 2.4649462699890137, -0.998044490814209, 7.44166374206543, -2.0768048763275146, 3.5896823406219482, -7.305543422698975, -7.562084674835205, 4.32333517074585, 0.08044180274009705, -6.564010143280029, -2.314805269241333, -1.7642345428466797, -2.470881700515747, -7.6756181716918945, -9.548877716064453, -1.017755389213562, 0.1698644608259201, 2.5877134799957275, -1.8752295970916748, -0.36614322662353516, -6.049378395080566, -2.3965611457824707, -5.945338726043701, 0.9424033164978027, -13.155974388122559, -7.45780086517334, 0.14658108353614807, -3.7427968978881836, 5.841492652893066, -1.2872905731201172, 5.569431304931641, 12.570590019226074, 1.0939218997955322, 2.2142086029052734, 1.9181575775146484, 6.991420745849609, -5.888138771057129, 3.1409823894500732, -2.0036280155181885, 2.4434285163879395, 9.973138809204102, 5.036680221557617, 2.005120277404785, 2.861560344696045, 5.860223770141602, 2.917618751525879, -1.63111412525177, 2.0292205810546875, -4.070415019989014, -6.831437110900879]}} ``` #### 7.2 音频声纹打分 @@ -332,19 +329,18 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee * test: 测试音频。 输出: - - ``` bash - [2022-05-25 12:33:24,527] [ INFO] - vector score http client start - [2022-05-25 12:33:24,527] [ INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav - [2022-05-25 12:33:24,528] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector/score - [2022-05-25 12:33:24,695] [ INFO] - The vector score is: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}} - [2022-05-25 12:33:24,696] [ INFO] - The vector: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}} - [2022-05-25 12:33:24,696] [ INFO] - Response time 0.168271 s. + ```text + [2022-05-25 12:33:24,527] [ INFO] - vector score http client start + [2022-05-25 12:33:24,527] [ INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav + [2022-05-25 12:33:24,528] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector/score + [2022-05-25 12:33:24,695] [ INFO] - The vector score is: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}} + [2022-05-25 12:33:24,696] [ INFO] - The vector: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}} + [2022-05-25 12:33:24,696] [ INFO] - Response time 0.168271 s. ``` * Python API - ``` python + ```python from paddlespeech.server.bin.paddlespeech_client import VectorClientExecutor vectorclient_executor = VectorClientExecutor() @@ -359,8 +355,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ``` 输出: - - ``` bash + ```text [2022-05-25 12:30:14,143] [ INFO] - vector score http client start [2022-05-25 12:30:14,143] [ INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav [2022-05-25 12:30:14,143] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector/score @@ -368,7 +363,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}} ``` - ### 8. 标点预测 **注意:** 初次使用客户端时响应时间会略长 @@ -391,9 +385,9 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee - `input`(必须输入): 用于标点预测的文本内容。 输出: - ```bash - [2022-05-09 18:19:04,397] [ INFO] - The punc text: 我认为跑步最重要的就是给我带来了身体健康。 - [2022-05-09 18:19:04,397] [ INFO] - Response time 0.092407 s. + ```text + [2022-05-09 18:19:04,397] [ INFO] - The punc text: 我认为跑步最重要的就是给我带来了身体健康。 + [2022-05-09 18:19:04,397] [ INFO] - Response time 0.092407 s. ``` - Python API @@ -406,11 +400,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee server_ip="127.0.0.1", port=8090,) print(res) - ``` 输出: - ```bash + ```text 我认为跑步最重要的就是给我带来了身体健康。 ``` @@ -419,10 +412,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee 通过 `paddlespeech_server stats --task asr` 获取 ASR 服务支持的所有模型,其中静态模型可用于 paddle inference 推理。 ### TTS 支持的模型 -通过 `paddlespeech_server stats --task tts` 获取 TTS 服务支持的所有模型,其中静态模型可用于 paddle inference 推理。 +通过 `paddlespeech_server stats --task tts` 获取 TTS 服务支持的所有模型,其中静态模型可用于 paddle inference 推理。 ### CLS 支持的模型 -通过 `paddlespeech_server stats --task cls` 获取 CLS 服务支持的所有模型,其中静态模型可用于 paddle inference 推理。 +通过 `paddlespeech_server stats --task cls` 获取 CLS 服务支持的所有模型,其中静态模型可用于 paddle inference 推理。 ### Vector 支持的模型 通过 `paddlespeech_server stats --task vector` 获取 Vector 服务支持的所有模型。 diff --git "a/demos/speech_web/\346\216\245\345\217\243\346\226\207\346\241\243.md" b/demos/speech_web/API.md similarity index 82% rename from "demos/speech_web/\346\216\245\345\217\243\346\226\207\346\241\243.md" rename to demos/speech_web/API.md index a811a3f4e55284640704941cd10b888f1b0373e6..c5144674940a0170ab380c5477c0a32d998ad2d5 100644 --- "a/demos/speech_web/\346\216\245\345\217\243\346\226\207\346\241\243.md" +++ b/demos/speech_web/API.md @@ -8,7 +8,7 @@ http://0.0.0.0:8010/docs ### 【POST】/asr/offline -说明:上传16k,16bit wav文件,返回 offline 语音识别模型识别结果 +说明:上传 16k, 16bit wav 文件,返回 offline 语音识别模型识别结果 返回: JSON @@ -26,11 +26,11 @@ http://0.0.0.0:8010/docs ### 【POST】/asr/offlinefile -说明:上传16k,16bit wav文件,返回 offline 语音识别模型识别结果 + wav数据的base64 +说明:上传16k,16bit wav文件,返回 offline 语音识别模型识别结果 + wav 数据的 base64 返回: JSON -前端接口: 音频文件识别(播放这段base64还原后记得添加wav头,采样率16k, int16,添加后才能播放) +前端接口: 音频文件识别(播放这段base64还原后记得添加 wav 头,采样率 16k, int16,添加后才能播放) 示例: @@ -48,7 +48,7 @@ http://0.0.0.0:8010/docs ### 【POST】/asr/collectEnv -说明: 通过采集环境噪音,上传16k, int16 wav文件,来生成后台VAD的能量阈值, 返回阈值结果 +说明: 通过采集环境噪音,上传 16k, int16 wav 文件,来生成后台 VAD 的能量阈值, 返回阈值结果 前端接口:ASR-环境采样 @@ -64,9 +64,9 @@ http://0.0.0.0:8010/docs ### 【GET】/asr/stopRecord -说明:通过 GET 请求 /asr/stopRecord, 后台停止接收 offlineStream 中通过 WS协议 上传的数据 +说明:通过 GET 请求 /asr/stopRecord, 后台停止接收 offlineStream 中通过 WS 协议 上传的数据 -前端接口:语音聊天-暂停录音(获取NLP,播放TTS时暂停) +前端接口:语音聊天-暂停录音(获取 NLP,播放 TTS 时暂停) 返回: JSON @@ -80,9 +80,9 @@ http://0.0.0.0:8010/docs ### 【GET】/asr/resumeRecord -说明:通过 GET 请求 /asr/resumeRecord, 后台停止接收 offlineStream 中通过 WS协议 上传的数据 +说明:通过 GET 请求 /asr/resumeRecord, 后台停止接收 offlineStream 中通过 WS 协议 上传的数据 -前端接口:语音聊天-恢复录音(TTS播放完毕时,告诉后台恢复录音) +前端接口:语音聊天-恢复录音( TTS 播放完毕时,告诉后台恢复录音) 返回: JSON @@ -100,16 +100,16 @@ http://0.0.0.0:8010/docs 前端接口:语音聊天-开始录音,持续将麦克风语音传给后端,后端推送语音识别结果 -返回:后端返回识别结果,offline模型识别结果, 由WS推送 +返回:后端返回识别结果,offline 模型识别结果, 由WS推送 ### 【Websocket】/ws/asr/onlineStream -说明:通过 WS 协议,将前端音频持续上传到后台,前端采集 16k,Int16 类型的PCM片段,持续上传到后端 +说明:通过 WS 协议,将前端音频持续上传到后台,前端采集 16k,Int16 类型的 PCM 片段,持续上传到后端 前端接口:ASR-流式识别开始录音,持续将麦克风语音传给后端,后端推送语音识别结果 -返回:后端返回识别结果,online模型识别结果, 由WS推送 +返回:后端返回识别结果,online 模型识别结果, 由 WS 推送 ## NLP @@ -202,7 +202,7 @@ http://0.0.0.0:8010/docs ### 【POST】/tts/offline -说明:获取TTS离线模型音频 +说明:获取 TTS 离线模型音频 前端接口:TTS-端到端合成 @@ -272,7 +272,7 @@ curl -X 'POST' \ ### 【POST】/vpr/recog -说明:声纹识别,识别文件,提取文件的声纹信息做比对 音频 16k, int 16 wav格式 +说明:声纹识别,识别文件,提取文件的声纹信息做比对 音频 16k, int 16 wav 格式 前端接口:声纹识别-上传音频,返回声纹识别结果 @@ -383,9 +383,9 @@ curl -X 'GET' \ ### 【GET】/vpr/database64 -说明: 根据 vpr_id 获取用户vpr时注册使用音频转换成 16k, int16 类型的数组,返回base64编码 +说明: 根据 vpr_id 获取用户 vpr 时注册使用音频转换成 16k, int16 类型的数组,返回 base64 编码 -前端接口:声纹识别-获取vpr对应的音频(注意:播放时需要添加 wav头,16k,int16, 可参考tts播放时添加wav的方式,注意更改采样率) +前端接口:声纹识别-获取 vpr 对应的音频(注意:播放时需要添加 wav头,16k,int16, 可参考 tts 播放时添加 wav 的方式,注意更改采样率) 访问示例: @@ -401,6 +401,4 @@ curl -X 'GET' \ "code": 0, "result":"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA", "message": "ok" -``` - - +``` \ No newline at end of file diff --git a/demos/speech_web/README_cn.md b/demos/speech_web/README.md similarity index 63% rename from demos/speech_web/README_cn.md rename to demos/speech_web/README.md index eb90bb39ce7998c98eadc9603a49e78c1d82d561..3b2da6e9a52fa6e93623fd52580c1ca090642f4d 100644 --- a/demos/speech_web/README_cn.md +++ b/demos/speech_web/README.md @@ -1,16 +1,16 @@ # Paddle Speech Demo -PaddleSpeechDemo是一个以PaddleSpeech的语音交互功能为主体开发的Demo展示项目,用于帮助大家更好的上手PaddleSpeech以及使用PaddleSpeech构建自己的应用。 +PaddleSpeechDemo 是一个以 PaddleSpeech 的语音交互功能为主体开发的 Demo 展示项目,用于帮助大家更好的上手 PaddleSpeech 以及使用 PaddleSpeech 构建自己的应用。 -智能语音交互部分使用PaddleSpeech,对话以及信息抽取部分使用PaddleNLP,网页前端展示部分基于Vue3进行开发 +智能语音交互部分使用 PaddleSpeech,对话以及信息抽取部分使用 PaddleNLP,网页前端展示部分基于 Vue3 进行开发 主要功能: -+ 语音聊天:PaddleSpeech的语音识别能力+语音合成能力,对话部分基于PaddleNLP的闲聊功能 -+ 声纹识别:PaddleSpeech的声纹识别功能展示 ++ 语音聊天:PaddleSpeech 的语音识别能力+语音合成能力,对话部分基于 PaddleNLP 的闲聊功能 ++ 声纹识别:PaddleSpeech 的声纹识别功能展示 + 语音识别:支持【实时语音识别】,【端到端识别】,【音频文件识别】三种模式 + 语音合成:支持【流式合成】与【端到端合成】两种方式 -+ 语音指令:基于PaddleSpeech的语音识别能力与PaddleNLP的信息抽取,实现交通费的智能报销 ++ 语音指令:基于 PaddleSpeech 的语音识别能力与 PaddleNLP 的信息抽取,实现交通费的智能报销 运行效果: @@ -32,23 +32,21 @@ cd model wget https://bj.bcebos.com/paddlenlp/applications/speech-cmd-analysis/finetune/model_state.pdparams ``` - ### 前端环境安装 -前端依赖node.js ,需要提前安装,确保npm可用,npm测试版本8.3.1,建议下载[官网](https://nodejs.org/en/)稳定版的node.js +前端依赖 `node.js` ,需要提前安装,确保 `npm` 可用,`npm` 测试版本 `8.3.1`,建议下载[官网](https://nodejs.org/en/)稳定版的 `node.js` ``` # 进入前端目录 cd web_client -# 安装yarn,已经安装可跳过 +# 安装 `yarn`,已经安装可跳过 npm install -g yarn # 使用yarn安装前端依赖 yarn install ``` - ## 启动服务 ### 开启后端服务 @@ -66,18 +64,18 @@ cd web_client yarn dev --port 8011 ``` -默认配置下,前端中配置的后台地址信息是localhost,确保后端服务器和打开页面的游览器在同一台机器上,不在一台机器的配置方式见下方的FAQ:【后端如果部署在其它机器或者别的端口如何修改】 +默认配置下,前端中配置的后台地址信息是 localhost,确保后端服务器和打开页面的游览器在同一台机器上,不在一台机器的配置方式见下方的 FAQ:【后端如果部署在其它机器或者别的端口如何修改】 ## FAQ #### Q: 如何安装node.js -A: node.js的安装可以参考[【菜鸟教程】](https://www.runoob.com/nodejs/nodejs-install-setup.html), 确保npm可用 +A: node.js的安装可以参考[【菜鸟教程】](https://www.runoob.com/nodejs/nodejs-install-setup.html), 确保 npm 可用 #### Q:后端如果部署在其它机器或者别的端口如何修改 A:后端的配置地址有分散在两个文件中 -修改第一个文件`PaddleSpeechWebClient/vite.config.js` +修改第一个文件 `PaddleSpeechWebClient/vite.config.js` ``` server: { @@ -92,7 +90,7 @@ server: { } ``` -修改第二个文件`PaddleSpeechWebClient/src/api/API.js`(Websocket代理配置失败,所以需要在这个文件中修改) +修改第二个文件 `PaddleSpeechWebClient/src/api/API.js`( Websocket 代理配置失败,所以需要在这个文件中修改) ``` // websocket (这里改成后端所在的接口) @@ -107,9 +105,6 @@ A:这里主要是游览器安全策略的限制,需要配置游览器后重 chrome设置地址: chrome://flags/#unsafely-treat-insecure-origin-as-secure - - - ## 参考资料 vue实现录音参考资料:https://blog.csdn.net/qq_41619796/article/details/107865602#t1 diff --git a/demos/streaming_asr_server/README.md b/demos/streaming_asr_server/README.md index 773d4957c0f8504a443b26f1ad2e21622e2dd1f7..a974867579363cd5f2ee1433414747a432883f7a 100644 --- a/demos/streaming_asr_server/README.md +++ b/demos/streaming_asr_server/README.md @@ -7,13 +7,18 @@ This demo is an implementation of starting the streaming speech service and acce Streaming ASR server only support `websocket` protocol, and doesn't support `http` protocol. +服务接口定义请参考: +- [PaddleSpeech Streaming Server WebSocket API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-WebSocket-API) + ## Usage ### 1. Installation see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). -It is recommended to use **paddlepaddle 2.2.1** or above. +It is recommended to use **paddlepaddle 2.3.1** or above. + You can choose one way from easy, meduim and hard to install paddlespeech. -**If you install in simple mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.** + +**If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to ### 2. Prepare config File The configuration file can be found in `conf/ws_application.yaml` 和 `conf/ws_conformer_wenetspeech_application.yaml`. @@ -48,28 +53,28 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav - `log_file`: log file. Default: `./log/paddlespeech.log` Output: - ```bash - [2022-05-14 04:56:13,086] [ INFO] - create the online asr engine instance - [2022-05-14 04:56:13,086] [ INFO] - paddlespeech_server set the device: cpu - [2022-05-14 04:56:13,087] [ INFO] - Load the pretrained model, tag = conformer_online_wenetspeech-zh-16k - [2022-05-14 04:56:13,087] [ INFO] - File /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz md5 checking... - [2022-05-14 04:56:17,542] [ INFO] - Use pretrained model stored in: /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1. 0.0a.model.tar - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/model.yaml - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams - [2022-05-14 04:56:17,852] [ INFO] - start to create the stream conformer asr engine - [2022-05-14 04:56:17,863] [ INFO] - model name: conformer_online - [2022-05-14 04:56:22,756] [ INFO] - create the transformer like model success - [2022-05-14 04:56:22,758] [ INFO] - Initialize ASR server engine successfully. - INFO: Started server process [4242] - [2022-05-14 04:56:22] [INFO] [server.py:75] Started server process [4242] - INFO: Waiting for application startup. - [2022-05-14 04:56:22] [INFO] [on.py:45] Waiting for application startup. - INFO: Application startup complete. - [2022-05-14 04:56:22] [INFO] [on.py:59] Application startup complete. - INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) - [2022-05-14 04:56:22] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) + ```text + [2022-05-14 04:56:13,086] [ INFO] - create the online asr engine instance + [2022-05-14 04:56:13,086] [ INFO] - paddlespeech_server set the device: cpu + [2022-05-14 04:56:13,087] [ INFO] - Load the pretrained model, tag = conformer_online_wenetspeech-zh-16k + [2022-05-14 04:56:13,087] [ INFO] - File /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz md5 checking... + [2022-05-14 04:56:17,542] [ INFO] - Use pretrained model stored in: /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1. 0.0a.model.tar + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/model.yaml + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams + [2022-05-14 04:56:17,852] [ INFO] - start to create the stream conformer asr engine + [2022-05-14 04:56:17,863] [ INFO] - model name: conformer_online + [2022-05-14 04:56:22,756] [ INFO] - create the transformer like model success + [2022-05-14 04:56:22,758] [ INFO] - Initialize ASR server engine successfully. + INFO: Started server process [4242] + [2022-05-14 04:56:22] [INFO] [server.py:75] Started server process [4242] + INFO: Waiting for application startup. + [2022-05-14 04:56:22] [INFO] [on.py:45] Waiting for application startup. + INFO: Application startup complete. + [2022-05-14 04:56:22] [INFO] [on.py:59] Application startup complete. + INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) + [2022-05-14 04:56:22] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) ``` - Python API @@ -85,28 +90,28 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ``` Output: - ```bash - [2022-05-14 04:56:13,086] [ INFO] - create the online asr engine instance - [2022-05-14 04:56:13,086] [ INFO] - paddlespeech_server set the device: cpu - [2022-05-14 04:56:13,087] [ INFO] - Load the pretrained model, tag = conformer_online_wenetspeech-zh-16k - [2022-05-14 04:56:13,087] [ INFO] - File /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz md5 checking... - [2022-05-14 04:56:17,542] [ INFO] - Use pretrained model stored in: /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1. 0.0a.model.tar - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/model.yaml - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams - [2022-05-14 04:56:17,852] [ INFO] - start to create the stream conformer asr engine - [2022-05-14 04:56:17,863] [ INFO] - model name: conformer_online - [2022-05-14 04:56:22,756] [ INFO] - create the transformer like model success - [2022-05-14 04:56:22,758] [ INFO] - Initialize ASR server engine successfully. - INFO: Started server process [4242] - [2022-05-14 04:56:22] [INFO] [server.py:75] Started server process [4242] - INFO: Waiting for application startup. - [2022-05-14 04:56:22] [INFO] [on.py:45] Waiting for application startup. - INFO: Application startup complete. - [2022-05-14 04:56:22] [INFO] [on.py:59] Application startup complete. - INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) - [2022-05-14 04:56:22] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) + ```text + [2022-05-14 04:56:13,086] [ INFO] - create the online asr engine instance + [2022-05-14 04:56:13,086] [ INFO] - paddlespeech_server set the device: cpu + [2022-05-14 04:56:13,087] [ INFO] - Load the pretrained model, tag = conformer_online_wenetspeech-zh-16k + [2022-05-14 04:56:13,087] [ INFO] - File /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz md5 checking... + [2022-05-14 04:56:17,542] [ INFO] - Use pretrained model stored in: /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1. 0.0a.model.tar + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/model.yaml + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams + [2022-05-14 04:56:17,852] [ INFO] - start to create the stream conformer asr engine + [2022-05-14 04:56:17,863] [ INFO] - model name: conformer_online + [2022-05-14 04:56:22,756] [ INFO] - create the transformer like model success + [2022-05-14 04:56:22,758] [ INFO] - Initialize ASR server engine successfully. + INFO: Started server process [4242] + [2022-05-14 04:56:22] [INFO] [server.py:75] Started server process [4242] + INFO: Waiting for application startup. + [2022-05-14 04:56:22] [INFO] [on.py:45] Waiting for application startup. + INFO: Application startup complete. + [2022-05-14 04:56:22] [INFO] [on.py:59] Application startup complete. + INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) + [2022-05-14 04:56:22] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) ``` @@ -117,7 +122,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav If `127.0.0.1` is not accessible, you need to use the actual service IP address. - ``` + ```bash paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input ./zh.wav ``` @@ -126,6 +131,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ```bash paddlespeech_client asr_online --help ``` + Arguments: - `server_ip`: server ip. Default: 127.0.0.1 - `port`: server port. Default: 8090 @@ -137,75 +143,74 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav - `punc.server_port`: punctuation server port. Default: None. Output: - ```bash - [2022-05-06 21:10:35,598] [ INFO] - Start to do streaming asr client - [2022-05-06 21:10:35,600] [ INFO] - asr websocket client start - [2022-05-06 21:10:35,600] [ INFO] - endpoint: ws://127.0.0.1:8390/paddlespeech/asr/streaming - [2022-05-06 21:10:35,600] [ INFO] - start to process the wavscp: ./zh.wav - [2022-05-06 21:10:35,670] [ INFO] - client receive msg={"status": "ok", "signal": "server_ready"} - [2022-05-06 21:10:35,699] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,713] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,726] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,738] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,750] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,762] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,774] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,786] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,387] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,398] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,407] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,416] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,425] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,434] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,442] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,930] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,938] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,946] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,954] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,962] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,970] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,977] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,985] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:37,484] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:37,492] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:37,500] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:37,508] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:37,517] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:37,525] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:37,532] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:38,050] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,058] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,066] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,073] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,081] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,089] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,097] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,105] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,630] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:38,639] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:38,647] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:38,655] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:38,663] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:38,671] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:38,679] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:39,216] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,224] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,232] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,240] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,248] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,256] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,264] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,272] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,885] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-05-06 21:10:39,896] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-05-06 21:10:39,905] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-05-06 21:10:39,915] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-05-06 21:10:39,924] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-05-06 21:10:39,934] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-05-06 21:10:44,827] [ INFO] - client final receive msg={'status': 'ok', 'signal': 'finished', 'result': '我认为跑步最重要的就是给我带来了身体健康', 'times': [{'w': '我', 'bg': 0.0, 'ed': 0.7000000000000001}, {'w': '认', 'bg': 0.7000000000000001, 'ed': 0.84}, {'w': '为', 'bg': 0.84, 'ed': 1.0}, {'w': '跑', 'bg': 1.0, 'ed': 1.18}, {'w': '步', 'bg': 1.18, 'ed': 1.36}, {'w': '最', 'bg': 1.36, 'ed': 1.5}, {'w': '重', 'bg': 1.5, 'ed': 1.6400000000000001}, {'w': '要', 'bg': 1.6400000000000001, 'ed': 1.78}, {'w': '的', 'bg': 1.78, 'ed': 1.9000000000000001}, {'w': '就', 'bg': 1.9000000000000001, 'ed': 2.06}, {'w': '是', 'bg': 2.06, 'ed': 2.62}, {'w': '给', 'bg': 2.62, 'ed': 3.16}, {'w': '我', 'bg': 3.16, 'ed': 3.3200000000000003}, {'w': '带', 'bg': 3.3200000000000003, 'ed': 3.48}, {'w': '来', 'bg': 3.48, 'ed': 3.62}, {'w': '了', 'bg': 3.62, 'ed': 3.7600000000000002}, {'w': '身', 'bg': 3.7600000000000002, 'ed': 3.9}, {'w': '体', 'bg': 3.9, 'ed': 4.0600000000000005}, {'w': '健', 'bg': 4.0600000000000005, 'ed': 4.26}, {'w': '康', 'bg': 4.26, 'ed': 4.96}]} - [2022-05-06 21:10:44,827] [ INFO] - audio duration: 4.9968125, elapsed time: 9.225094079971313, RTF=1.846195765794957 - [2022-05-06 21:10:44,828] [ INFO] - asr websocket client finished : 我认为跑步最重要的就是给我带来了身体健康 - + ```text + [2022-05-06 21:10:35,598] [ INFO] - Start to do streaming asr client + [2022-05-06 21:10:35,600] [ INFO] - asr websocket client start + [2022-05-06 21:10:35,600] [ INFO] - endpoint: ws://127.0.0.1:8390/paddlespeech/asr/streaming + [2022-05-06 21:10:35,600] [ INFO] - start to process the wavscp: ./zh.wav + [2022-05-06 21:10:35,670] [ INFO] - client receive msg={"status": "ok", "signal": "server_ready"} + [2022-05-06 21:10:35,699] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,713] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,726] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,738] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,750] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,762] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,774] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,786] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,387] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,398] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,407] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,416] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,425] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,434] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,442] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,930] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,938] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,946] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,954] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,962] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,970] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,977] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,985] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:37,484] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:37,492] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:37,500] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:37,508] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:37,517] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:37,525] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:37,532] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:38,050] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,058] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,066] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,073] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,081] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,089] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,097] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,105] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,630] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:38,639] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:38,647] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:38,655] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:38,663] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:38,671] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:38,679] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:39,216] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,224] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,232] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,240] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,248] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,256] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,264] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,272] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,885] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} + [2022-05-06 21:10:39,896] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} + [2022-05-06 21:10:39,905] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} + [2022-05-06 21:10:39,915] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} + [2022-05-06 21:10:39,924] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} + [2022-05-06 21:10:39,934] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} + [2022-05-06 21:10:44,827] [ INFO] - client final receive msg={'status': 'ok', 'signal': 'finished', 'result': '我认为跑步最重要的就是给我带来了身体健康', 'times': [{'w': '我', 'bg': 0.0, 'ed': 0.7000000000000001}, {'w': '认', 'bg': 0.7000000000000001, 'ed': 0.84}, {'w': '为', 'bg': 0.84, 'ed': 1.0}, {'w': '跑', 'bg': 1.0, 'ed': 1.18}, {'w': '步', 'bg': 1.18, 'ed': 1.36}, {'w': '最', 'bg': 1.36, 'ed': 1.5}, {'w': '重', 'bg': 1.5, 'ed': 1.6400000000000001}, {'w': '要', 'bg': 1.6400000000000001, 'ed': 1.78}, {'w': '的', 'bg': 1.78, 'ed': 1.9000000000000001}, {'w': '就', 'bg': 1.9000000000000001, 'ed': 2.06}, {'w': '是', 'bg': 2.06, 'ed': 2.62}, {'w': '给', 'bg': 2.62, 'ed': 3.16}, {'w': '我', 'bg': 3.16, 'ed': 3.3200000000000003}, {'w': '带', 'bg': 3.3200000000000003, 'ed': 3.48}, {'w': '来', 'bg': 3.48, 'ed': 3.62}, {'w': '了', 'bg': 3.62, 'ed': 3.7600000000000002}, {'w': '身', 'bg': 3.7600000000000002, 'ed': 3.9}, {'w': '体', 'bg': 3.9, 'ed': 4.0600000000000005}, {'w': '健', 'bg': 4.0600000000000005, 'ed': 4.26}, {'w': '康', 'bg': 4.26, 'ed': 4.96}]} + [2022-05-06 21:10:44,827] [ INFO] - audio duration: 4.9968125, elapsed time: 9.225094079971313, RTF=1.846195765794957 + [2022-05-06 21:10:44,828] [ INFO] - asr websocket client finished : 我认为跑步最重要的就是给我带来了身体健康 ``` - Python API @@ -224,7 +229,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ``` Output: - ```bash + ```text [2022-05-06 21:14:03,137] [ INFO] - asr websocket client start [2022-05-06 21:14:03,137] [ INFO] - endpoint: ws://127.0.0.1:8390/paddlespeech/asr/streaming [2022-05-06 21:14:03,149] [ INFO] - client receive msg={"status": "ok", "signal": "server_ready"} @@ -299,12 +304,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav - Command Line **Note:** The default deployment of the server is on the 'CPU' device, which can be deployed on the 'GPU' by modifying the 'device' parameter in the service configuration file. - ``` bash + ```bash In PaddleSpeech/demos/streaming_asr_server directory to lanuch punctuation service paddlespeech_server start --config_file conf/punc_application.yaml ``` - Usage: ```bash paddlespeech_server start --help @@ -316,7 +320,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav Output: - ``` bash + ```text [2022-05-02 17:59:26,285] [ INFO] - Create the TextEngine Instance [2022-05-02 17:59:26,285] [ INFO] - Init the text engine [2022-05-02 17:59:26,285] [ INFO] - Text Engine set the device: gpu:0 @@ -348,26 +352,26 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav log_file="./log/paddlespeech.log") ``` - Output: - ``` - [2022-05-02 18:09:02,542] [ INFO] - Create the TextEngine Instance - [2022-05-02 18:09:02,543] [ INFO] - Init the text engine - [2022-05-02 18:09:02,543] [ INFO] - Text Engine set the device: gpu:0 - [2022-05-02 18:09:02,545] [ INFO] - File /home/users/xiongxinlei/.paddlespeech/models/ernie_linear_p3_wudao-punc-zh/ernie_linear_p3_wudao-punc-zh.tar.gz md5 checking... - [2022-05-02 18:09:06,919] [ INFO] - Use pretrained model stored in: /home/users/xiongxinlei/.paddlespeech/models/ernie_linear_p3_wudao-punc-zh/ernie_linear_p3_wudao-punc-zh.tar - W0502 18:09:07.523002 22615 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 10.2, Runtime API Version: 10.2 - W0502 18:09:07.527882 22615 device_context.cc:465] device: 0, cuDNN Version: 7.6. - [2022-05-02 18:09:10,900] [ INFO] - Already cached /home/users/xiongxinlei/.paddlenlp/models/ernie-1.0/vocab.txt - [2022-05-02 18:09:10,913] [ INFO] - Init the text engine successfully - INFO: Started server process [22615] - [2022-05-02 18:09:10] [INFO] [server.py:75] Started server process [22615] - INFO: Waiting for application startup. - [2022-05-02 18:09:10] [INFO] [on.py:45] Waiting for application startup. - INFO: Application startup complete. - [2022-05-02 18:09:10] [INFO] [on.py:59] Application startup complete. - INFO: Uvicorn running on http://0.0.0.0:8190 (Press CTRL+C to quit) - [2022-05-02 18:09:10] [INFO] [server.py:206] Uvicorn running on http://0.0.0.0:8190 (Press CTRL+C to quit) - ``` + Output: + ```text + [2022-05-02 18:09:02,542] [ INFO] - Create the TextEngine Instance + [2022-05-02 18:09:02,543] [ INFO] - Init the text engine + [2022-05-02 18:09:02,543] [ INFO] - Text Engine set the device: gpu:0 + [2022-05-02 18:09:02,545] [ INFO] - File /home/users/xiongxinlei/.paddlespeech/models/ernie_linear_p3_wudao-punc-zh/ernie_linear_p3_wudao-punc-zh.tar.gz md5 checking... + [2022-05-02 18:09:06,919] [ INFO] - Use pretrained model stored in: /home/users/xiongxinlei/.paddlespeech/models/ernie_linear_p3_wudao-punc-zh/ernie_linear_p3_wudao-punc-zh.tar + W0502 18:09:07.523002 22615 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 10.2, Runtime API Version: 10.2 + W0502 18:09:07.527882 22615 device_context.cc:465] device: 0, cuDNN Version: 7.6. + [2022-05-02 18:09:10,900] [ INFO] - Already cached /home/users/xiongxinlei/.paddlenlp/models/ernie-1.0/vocab.txt + [2022-05-02 18:09:10,913] [ INFO] - Init the text engine successfully + INFO: Started server process [22615] + [2022-05-02 18:09:10] [INFO] [server.py:75] Started server process [22615] + INFO: Waiting for application startup. + [2022-05-02 18:09:10] [INFO] [on.py:45] Waiting for application startup. + INFO: Application startup complete. + [2022-05-02 18:09:10] [INFO] [on.py:59] Application startup complete. + INFO: Uvicorn running on http://0.0.0.0:8190 (Press CTRL+C to quit) + [2022-05-02 18:09:10] [INFO] [server.py:206] Uvicorn running on http://0.0.0.0:8190 (Press CTRL+C to quit) + ``` ### 2. Client usage **Note** The response time will be slightly longer when using the client for the first time @@ -376,17 +380,17 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav If `127.0.0.1` is not accessible, you need to use the actual service IP address. - ``` + ```bash paddlespeech_client text --server_ip 127.0.0.1 --port 8190 --input "我认为跑步最重要的就是给我带来了身体健康" ``` Output - ``` + ```text [2022-05-02 18:12:29,767] [ INFO] - The punc text: 我认为跑步最重要的就是给我带来了身体健康。 [2022-05-02 18:12:29,767] [ INFO] - Response time 0.096548 s. ``` -- Python3 API +- Python API ```python from paddlespeech.server.bin.paddlespeech_client import TextClientExecutor @@ -400,11 +404,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ``` Output: - ``` bash + ```text 我认为跑步最重要的就是给我带来了身体健康。 ``` - ## Join streaming asr and punctuation server By default, each server is deployed on the 'CPU' device and speech recognition and punctuation prediction can be deployed on different 'GPU' by modifying the' device 'parameter in the service configuration file respectively. @@ -413,7 +416,7 @@ We use `streaming_ asr_server.py` and `punc_server.py` two services to lanuch st ### 1. Start two server -``` bash +```bash Note: streaming speech recognition and punctuation prediction are configured on different graphics cards through configuration files bash server.sh ``` @@ -423,11 +426,11 @@ bash server.sh If `127.0.0.1` is not accessible, you need to use the actual service IP address. - ``` + ```bash paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8290 --punc.server_ip 127.0.0.1 --punc.port 8190 --input ./zh.wav ``` Output: - ``` + ```text [2022-05-07 11:21:47,060] [ INFO] - asr websocket client start [2022-05-07 11:21:47,060] [ INFO] - endpoint: ws://127.0.0.1:8490/paddlespeech/asr/streaming [2022-05-07 11:21:47,080] [ INFO] - client receive msg={"status": "ok", "signal": "server_ready"} @@ -501,11 +504,11 @@ bash server.sh If `127.0.0.1` is not accessible, you need to use the actual service IP address. - ``` + ```bash python3 websocket_client.py --server_ip 127.0.0.1 --port 8290 --punc.server_ip 127.0.0.1 --punc.port 8190 --wavfile ./zh.wav ``` Output: - ``` + ```text [2022-05-07 11:11:02,984] [ INFO] - Start to do streaming asr client [2022-05-07 11:11:02,985] [ INFO] - asr websocket client start [2022-05-07 11:11:02,985] [ INFO] - endpoint: ws://127.0.0.1:8490/paddlespeech/asr/streaming @@ -574,5 +577,3 @@ bash server.sh [2022-05-07 11:11:18,915] [ INFO] - audio duration: 4.9968125, elapsed time: 15.928460597991943, RTF=3.187724293835709 [2022-05-07 11:11:18,916] [ INFO] - asr websocket client finished : 我认为跑步最重要的就是给我带来了身体健康 ``` - - diff --git a/demos/streaming_asr_server/README_cn.md b/demos/streaming_asr_server/README_cn.md index f9fd536ea706823e3269d7541acd1f588d4266b8..267367729c53a2e34a8adf60e2f7222382040ba1 100644 --- a/demos/streaming_asr_server/README_cn.md +++ b/demos/streaming_asr_server/README_cn.md @@ -3,16 +3,21 @@ # 流式语音识别服务 ## 介绍 -这个demo是一个启动流式语音服务和访问服务的实现。 它可以通过使用`paddlespeech_server` 和 `paddlespeech_client`的单个命令或 python 的几行代码来实现。 +这个 demo 是一个启动流式语音服务和访问服务的实现。 它可以通过使用 `paddlespeech_server` 和 `paddlespeech_client` 的单个命令或 python 的几行代码来实现。 **流式语音识别服务只支持 `weboscket` 协议,不支持 `http` 协议。** +服务接口定义请参考: +- [PaddleSpeech Streaming Server WebSocket API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-WebSocket-API) + ## 使用方法 ### 1. 安装 安装 PaddleSpeech 的详细过程请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md)。 -推荐使用 **paddlepaddle 2.2.1** 或以上版本。 -你可以从简单, 中等,困难 几种种方式中选择一种方式安装 PaddleSpeech。 +推荐使用 **paddlepaddle 2.3.1** 或以上版本。 + +你可以从简单,中等,困难 几种方式中选择一种方式安装 PaddleSpeech。 + **如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。** ### 2. 准备配置文件 @@ -26,7 +31,6 @@ * conformer: `conf/ws_conformer_wenetspeech_application.yaml` - 这个 ASR client 的输入应该是一个 WAV 文件(`.wav`),并且采样率必须与模型的采样率相同。 可以下载此 ASR client的示例音频: @@ -54,28 +58,28 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav - `log_file`: log 文件. 默认:`./log/paddlespeech.log` 输出: - ```bash - [2022-05-14 04:56:13,086] [ INFO] - create the online asr engine instance - [2022-05-14 04:56:13,086] [ INFO] - paddlespeech_server set the device: cpu - [2022-05-14 04:56:13,087] [ INFO] - Load the pretrained model, tag = conformer_online_wenetspeech-zh-16k - [2022-05-14 04:56:13,087] [ INFO] - File /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz md5 checking... - [2022-05-14 04:56:17,542] [ INFO] - Use pretrained model stored in: /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1. 0.0a.model.tar - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/model.yaml - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams - [2022-05-14 04:56:17,852] [ INFO] - start to create the stream conformer asr engine - [2022-05-14 04:56:17,863] [ INFO] - model name: conformer_online - [2022-05-14 04:56:22,756] [ INFO] - create the transformer like model success - [2022-05-14 04:56:22,758] [ INFO] - Initialize ASR server engine successfully. - INFO: Started server process [4242] - [2022-05-14 04:56:22] [INFO] [server.py:75] Started server process [4242] - INFO: Waiting for application startup. - [2022-05-14 04:56:22] [INFO] [on.py:45] Waiting for application startup. - INFO: Application startup complete. - [2022-05-14 04:56:22] [INFO] [on.py:59] Application startup complete. - INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) - [2022-05-14 04:56:22] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) + ```text + [2022-05-14 04:56:13,086] [ INFO] - create the online asr engine instance + [2022-05-14 04:56:13,086] [ INFO] - paddlespeech_server set the device: cpu + [2022-05-14 04:56:13,087] [ INFO] - Load the pretrained model, tag = conformer_online_wenetspeech-zh-16k + [2022-05-14 04:56:13,087] [ INFO] - File /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz md5 checking... + [2022-05-14 04:56:17,542] [ INFO] - Use pretrained model stored in: /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1. 0.0a.model.tar + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/model.yaml + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams + [2022-05-14 04:56:17,852] [ INFO] - start to create the stream conformer asr engine + [2022-05-14 04:56:17,863] [ INFO] - model name: conformer_online + [2022-05-14 04:56:22,756] [ INFO] - create the transformer like model success + [2022-05-14 04:56:22,758] [ INFO] - Initialize ASR server engine successfully. + INFO: Started server process [4242] + [2022-05-14 04:56:22] [INFO] [server.py:75] Started server process [4242] + INFO: Waiting for application startup. + [2022-05-14 04:56:22] [INFO] [on.py:45] Waiting for application startup. + INFO: Application startup complete. + [2022-05-14 04:56:22] [INFO] [on.py:59] Application startup complete. + INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) + [2022-05-14 04:56:22] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) ``` - Python API @@ -90,29 +94,29 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav log_file="./log/paddlespeech.log") ``` - 输出: - ```bash - [2022-05-14 04:56:13,086] [ INFO] - create the online asr engine instance - [2022-05-14 04:56:13,086] [ INFO] - paddlespeech_server set the device: cpu - [2022-05-14 04:56:13,087] [ INFO] - Load the pretrained model, tag = conformer_online_wenetspeech-zh-16k - [2022-05-14 04:56:13,087] [ INFO] - File /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz md5 checking... - [2022-05-14 04:56:17,542] [ INFO] - Use pretrained model stored in: /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1. 0.0a.model.tar - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/model.yaml - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams - [2022-05-14 04:56:17,852] [ INFO] - start to create the stream conformer asr engine - [2022-05-14 04:56:17,863] [ INFO] - model name: conformer_online - [2022-05-14 04:56:22,756] [ INFO] - create the transformer like model success - [2022-05-14 04:56:22,758] [ INFO] - Initialize ASR server engine successfully. - INFO: Started server process [4242] - [2022-05-14 04:56:22] [INFO] [server.py:75] Started server process [4242] - INFO: Waiting for application startup. - [2022-05-14 04:56:22] [INFO] [on.py:45] Waiting for application startup. - INFO: Application startup complete. - [2022-05-14 04:56:22] [INFO] [on.py:59] Application startup complete. - INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) - [2022-05-14 04:56:22] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) + 输出: + ```text + [2022-05-14 04:56:13,086] [ INFO] - create the online asr engine instance + [2022-05-14 04:56:13,086] [ INFO] - paddlespeech_server set the device: cpu + [2022-05-14 04:56:13,087] [ INFO] - Load the pretrained model, tag = conformer_online_wenetspeech-zh-16k + [2022-05-14 04:56:13,087] [ INFO] - File /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz md5 checking... + [2022-05-14 04:56:17,542] [ INFO] - Use pretrained model stored in: /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1. 0.0a.model.tar + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/model.yaml + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams + [2022-05-14 04:56:17,852] [ INFO] - start to create the stream conformer asr engine + [2022-05-14 04:56:17,863] [ INFO] - model name: conformer_online + [2022-05-14 04:56:22,756] [ INFO] - create the transformer like model success + [2022-05-14 04:56:22,758] [ INFO] - Initialize ASR server engine successfully. + INFO: Started server process [4242] + [2022-05-14 04:56:22] [INFO] [server.py:75] Started server process [4242] + INFO: Waiting for application startup. + [2022-05-14 04:56:22] [INFO] [on.py:45] Waiting for application startup. + INFO: Application startup complete. + [2022-05-14 04:56:22] [INFO] [on.py:59] Application startup complete. + INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) + [2022-05-14 04:56:22] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) ``` ### 4. ASR 客户端使用方法 @@ -120,98 +124,97 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav **注意:** 初次使用客户端时响应时间会略长 - 命令行 (推荐使用) - 若 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址 + 若 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址 - ``` - paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input ./zh.wav - ``` + ```bash + paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input ./zh.wav + ``` - 使用帮助: - - ```bash - paddlespeech_client asr_online --help - ``` + 使用帮助: + + ```bash + paddlespeech_client asr_online --help + ``` + + 参数: + - `server_ip`: 服务端ip地址,默认: 127.0.0.1。 + - `port`: 服务端口,默认: 8090。 + - `input`(必须输入): 用于识别的音频文件。 + - `sample_rate`: 音频采样率,默认值:16000。 + - `lang`: 模型语言,默认值:zh_cn。 + - `audio_format`: 音频格式,默认值:wav。 + - `punc.server_ip` 标点预测服务的ip。默认是None。 + - `punc.server_port` 标点预测服务的端口port。默认是None。 - 参数: - - `server_ip`: 服务端ip地址,默认: 127.0.0.1。 - - `port`: 服务端口,默认: 8090。 - - `input`(必须输入): 用于识别的音频文件。 - - `sample_rate`: 音频采样率,默认值:16000。 - - `lang`: 模型语言,默认值:zh_cn。 - - `audio_format`: 音频格式,默认值:wav。 - - `punc.server_ip` 标点预测服务的ip。默认是None。 - - `punc.server_port` 标点预测服务的端口port。默认是None。 - - 输出: - - ```bash - [2022-05-06 21:10:35,598] [ INFO] - Start to do streaming asr client - [2022-05-06 21:10:35,600] [ INFO] - asr websocket client start - [2022-05-06 21:10:35,600] [ INFO] - endpoint: ws://127.0.0.1:8390/paddlespeech/asr/streaming - [2022-05-06 21:10:35,600] [ INFO] - start to process the wavscp: ./zh.wav - [2022-05-06 21:10:35,670] [ INFO] - client receive msg={"status": "ok", "signal": "server_ready"} - [2022-05-06 21:10:35,699] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,713] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,726] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,738] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,750] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,762] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,774] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,786] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,387] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,398] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,407] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,416] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,425] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,434] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,442] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,930] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,938] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,946] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,954] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,962] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,970] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,977] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,985] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:37,484] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:37,492] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:37,500] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:37,508] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:37,517] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:37,525] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:37,532] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:38,050] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,058] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,066] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,073] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,081] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,089] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,097] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,105] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,630] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:38,639] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:38,647] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:38,655] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:38,663] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:38,671] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:38,679] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:39,216] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,224] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,232] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,240] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,248] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,256] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,264] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,272] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,885] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-05-06 21:10:39,896] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-05-06 21:10:39,905] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-05-06 21:10:39,915] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-05-06 21:10:39,924] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-05-06 21:10:39,934] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-05-06 21:10:44,827] [ INFO] - client final receive msg={'status': 'ok', 'signal': 'finished', 'result': '我认为跑步最重要的就是给我带来了身体健康', 'times': [{'w': '我', 'bg': 0.0, 'ed': 0.7000000000000001}, {'w': '认', 'bg': 0.7000000000000001, 'ed': 0.84}, {'w': '为', 'bg': 0.84, 'ed': 1.0}, {'w': '跑', 'bg': 1.0, 'ed': 1.18}, {'w': '步', 'bg': 1.18, 'ed': 1.36}, {'w': '最', 'bg': 1.36, 'ed': 1.5}, {'w': '重', 'bg': 1.5, 'ed': 1.6400000000000001}, {'w': '要', 'bg': 1.6400000000000001, 'ed': 1.78}, {'w': '的', 'bg': 1.78, 'ed': 1.9000000000000001}, {'w': '就', 'bg': 1.9000000000000001, 'ed': 2.06}, {'w': '是', 'bg': 2.06, 'ed': 2.62}, {'w': '给', 'bg': 2.62, 'ed': 3.16}, {'w': '我', 'bg': 3.16, 'ed': 3.3200000000000003}, {'w': '带', 'bg': 3.3200000000000003, 'ed': 3.48}, {'w': '来', 'bg': 3.48, 'ed': 3.62}, {'w': '了', 'bg': 3.62, 'ed': 3.7600000000000002}, {'w': '身', 'bg': 3.7600000000000002, 'ed': 3.9}, {'w': '体', 'bg': 3.9, 'ed': 4.0600000000000005}, {'w': '健', 'bg': 4.0600000000000005, 'ed': 4.26}, {'w': '康', 'bg': 4.26, 'ed': 4.96}]} - [2022-05-06 21:10:44,827] [ INFO] - audio duration: 4.9968125, elapsed time: 9.225094079971313, RTF=1.846195765794957 - [2022-05-06 21:10:44,828] [ INFO] - asr websocket client finished : 我认为跑步最重要的就是给我带来了身体健康 + 输出: + ```text + [2022-05-06 21:10:35,598] [ INFO] - Start to do streaming asr client + [2022-05-06 21:10:35,600] [ INFO] - asr websocket client start + [2022-05-06 21:10:35,600] [ INFO] - endpoint: ws://127.0.0.1:8390/paddlespeech/asr/streaming + [2022-05-06 21:10:35,600] [ INFO] - start to process the wavscp: ./zh.wav + [2022-05-06 21:10:35,670] [ INFO] - client receive msg={"status": "ok", "signal": "server_ready"} + [2022-05-06 21:10:35,699] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,713] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,726] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,738] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,750] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,762] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,774] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,786] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,387] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,398] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,407] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,416] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,425] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,434] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,442] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,930] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,938] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,946] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,954] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,962] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,970] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,977] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,985] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:37,484] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:37,492] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:37,500] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:37,508] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:37,517] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:37,525] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:37,532] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:38,050] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,058] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,066] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,073] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,081] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,089] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,097] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,105] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,630] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:38,639] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:38,647] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:38,655] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:38,663] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:38,671] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:38,679] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:39,216] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,224] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,232] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,240] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,248] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,256] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,264] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,272] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,885] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} + [2022-05-06 21:10:39,896] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} + [2022-05-06 21:10:39,905] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} + [2022-05-06 21:10:39,915] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} + [2022-05-06 21:10:39,924] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} + [2022-05-06 21:10:39,934] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} + [2022-05-06 21:10:44,827] [ INFO] - client final receive msg={'status': 'ok', 'signal': 'finished', 'result': '我认为跑步最重要的就是给我带来了身体健康', 'times': [{'w': '我', 'bg': 0.0, 'ed': 0.7000000000000001}, {'w': '认', 'bg': 0.7000000000000001, 'ed': 0.84}, {'w': '为', 'bg': 0.84, 'ed': 1.0}, {'w': '跑', 'bg': 1.0, 'ed': 1.18}, {'w': '步', 'bg': 1.18, 'ed': 1.36}, {'w': '最', 'bg': 1.36, 'ed': 1.5}, {'w': '重', 'bg': 1.5, 'ed': 1.6400000000000001}, {'w': '要', 'bg': 1.6400000000000001, 'ed': 1.78}, {'w': '的', 'bg': 1.78, 'ed': 1.9000000000000001}, {'w': '就', 'bg': 1.9000000000000001, 'ed': 2.06}, {'w': '是', 'bg': 2.06, 'ed': 2.62}, {'w': '给', 'bg': 2.62, 'ed': 3.16}, {'w': '我', 'bg': 3.16, 'ed': 3.3200000000000003}, {'w': '带', 'bg': 3.3200000000000003, 'ed': 3.48}, {'w': '来', 'bg': 3.48, 'ed': 3.62}, {'w': '了', 'bg': 3.62, 'ed': 3.7600000000000002}, {'w': '身', 'bg': 3.7600000000000002, 'ed': 3.9}, {'w': '体', 'bg': 3.9, 'ed': 4.0600000000000005}, {'w': '健', 'bg': 4.0600000000000005, 'ed': 4.26}, {'w': '康', 'bg': 4.26, 'ed': 4.96}]} + [2022-05-06 21:10:44,827] [ INFO] - audio duration: 4.9968125, elapsed time: 9.225094079971313, RTF=1.846195765794957 + [2022-05-06 21:10:44,828] [ INFO] - asr websocket client finished : 我认为跑步最重要的就是给我带来了身体健康 ``` - Python API @@ -230,7 +233,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ``` 输出: - ```bash + ```text [2022-05-06 21:14:03,137] [ INFO] - asr websocket client start [2022-05-06 21:14:03,137] [ INFO] - endpoint: ws://127.0.0.1:8390/paddlespeech/asr/streaming [2022-05-06 21:14:03,149] [ INFO] - client receive msg={"status": "ok", "signal": "server_ready"} @@ -297,34 +300,29 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav [2022-05-06 21:14:12,159] [ INFO] - audio duration: 4.9968125, elapsed time: 9.019973039627075, RTF=1.8051453881103354 [2022-05-06 21:14:12,160] [ INFO] - asr websocket client finished ``` - - - ## 标点预测 ### 1. 服务端使用方法 - 命令行 **注意:** 默认部署在 `cpu` 设备上,可以通过修改服务配置文件中 `device` 参数部署在 `gpu` 上。 - ``` bash - 在 PaddleSpeech/demos/streaming_asr_server 目录下启动标点预测服务 + ```bash + # 在 PaddleSpeech/demos/streaming_asr_server 目录下启动标点预测服务 paddlespeech_server start --config_file conf/punc_application.yaml ``` - - 使用方法: - + 使用方法: ```bash paddlespeech_server start --help ``` - 参数: + 参数: - `config_file`: 服务的配置文件。 - `log_file`: log 文件。 - 输出: - ``` bash + 输出: + ```text [2022-05-02 17:59:26,285] [ INFO] - Create the TextEngine Instance [2022-05-02 17:59:26,285] [ INFO] - Init the text engine [2022-05-02 17:59:26,285] [ INFO] - Text Engine set the device: gpu:0 @@ -356,26 +354,26 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav log_file="./log/paddlespeech.log") ``` - 输出 - ``` - [2022-05-02 18:09:02,542] [ INFO] - Create the TextEngine Instance - [2022-05-02 18:09:02,543] [ INFO] - Init the text engine - [2022-05-02 18:09:02,543] [ INFO] - Text Engine set the device: gpu:0 - [2022-05-02 18:09:02,545] [ INFO] - File /home/users/xiongxinlei/.paddlespeech/models/ernie_linear_p3_wudao-punc-zh/ernie_linear_p3_wudao-punc-zh.tar.gz md5 checking... - [2022-05-02 18:09:06,919] [ INFO] - Use pretrained model stored in: /home/users/xiongxinlei/.paddlespeech/models/ernie_linear_p3_wudao-punc-zh/ernie_linear_p3_wudao-punc-zh.tar - W0502 18:09:07.523002 22615 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 10.2, Runtime API Version: 10.2 - W0502 18:09:07.527882 22615 device_context.cc:465] device: 0, cuDNN Version: 7.6. - [2022-05-02 18:09:10,900] [ INFO] - Already cached /home/users/xiongxinlei/.paddlenlp/models/ernie-1.0/vocab.txt - [2022-05-02 18:09:10,913] [ INFO] - Init the text engine successfully - INFO: Started server process [22615] - [2022-05-02 18:09:10] [INFO] [server.py:75] Started server process [22615] - INFO: Waiting for application startup. - [2022-05-02 18:09:10] [INFO] [on.py:45] Waiting for application startup. - INFO: Application startup complete. - [2022-05-02 18:09:10] [INFO] [on.py:59] Application startup complete. - INFO: Uvicorn running on http://0.0.0.0:8190 (Press CTRL+C to quit) - [2022-05-02 18:09:10] [INFO] [server.py:206] Uvicorn running on http://0.0.0.0:8190 (Press CTRL+C to quit) - ``` + 输出: + ```text + [2022-05-02 18:09:02,542] [ INFO] - Create the TextEngine Instance + [2022-05-02 18:09:02,543] [ INFO] - Init the text engine + [2022-05-02 18:09:02,543] [ INFO] - Text Engine set the device: gpu:0 + [2022-05-02 18:09:02,545] [ INFO] - File /home/users/xiongxinlei/.paddlespeech/models/ernie_linear_p3_wudao-punc-zh/ernie_linear_p3_wudao-punc-zh.tar.gz md5 checking... + [2022-05-02 18:09:06,919] [ INFO] - Use pretrained model stored in: /home/users/xiongxinlei/.paddlespeech/models/ernie_linear_p3_wudao-punc-zh/ernie_linear_p3_wudao-punc-zh.tar + W0502 18:09:07.523002 22615 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 10.2, Runtime API Version: 10.2 + W0502 18:09:07.527882 22615 device_context.cc:465] device: 0, cuDNN Version: 7.6. + [2022-05-02 18:09:10,900] [ INFO] - Already cached /home/users/xiongxinlei/.paddlenlp/models/ernie-1.0/vocab.txt + [2022-05-02 18:09:10,913] [ INFO] - Init the text engine successfully + INFO: Started server process [22615] + [2022-05-02 18:09:10] [INFO] [server.py:75] Started server process [22615] + INFO: Waiting for application startup. + [2022-05-02 18:09:10] [INFO] [on.py:45] Waiting for application startup. + INFO: Application startup complete. + [2022-05-02 18:09:10] [INFO] [on.py:59] Application startup complete. + INFO: Uvicorn running on http://0.0.0.0:8190 (Press CTRL+C to quit) + [2022-05-02 18:09:10] [INFO] [server.py:206] Uvicorn running on http://0.0.0.0:8190 (Press CTRL+C to quit) + ``` ### 2. 标点预测客户端使用方法 **注意:** 初次使用客户端时响应时间会略长 @@ -384,17 +382,17 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav 若 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址 - ``` - paddlespeech_client text --server_ip 127.0.0.1 --port 8190 --input "我认为跑步最重要的就是给我带来了身体健康" - ``` - - 输出 + ```bash + paddlespeech_client text --server_ip 127.0.0.1 --port 8190 --input "我认为跑步最重要的就是给我带来了身体健康" ``` + + 输出: + ```text [2022-05-02 18:12:29,767] [ INFO] - The punc text: 我认为跑步最重要的就是给我带来了身体健康。 [2022-05-02 18:12:29,767] [ INFO] - Response time 0.096548 s. ``` -- Python3 API +- Python API ```python from paddlespeech.server.bin.paddlespeech_client import TextClientExecutor @@ -407,12 +405,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav print(res) ``` - 输出: - ``` bash + 输出: + ```text 我认为跑步最重要的就是给我带来了身体健康。 ``` - ## 联合流式语音识别和标点预测 **注意:** 默认部署在 `cpu` 设备上,可以通过修改服务配置文件中 `device` 参数将语音识别和标点预测部署在不同的 `gpu` 上。 @@ -420,7 +417,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ### 1. 启动服务 -``` bash +```bash 注意:流式语音识别和标点预测通过配置文件配置到不同的显卡上 bash server.sh ``` @@ -430,11 +427,11 @@ bash server.sh 若 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址 - ``` + ```bash paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8290 --punc.server_ip 127.0.0.1 --punc.port 8190 --input ./zh.wav ``` - 输出: - ``` + 输出: + ```text [2022-05-07 11:21:47,060] [ INFO] - asr websocket client start [2022-05-07 11:21:47,060] [ INFO] - endpoint: ws://127.0.0.1:8490/paddlespeech/asr/streaming [2022-05-07 11:21:47,080] [ INFO] - client receive msg={"status": "ok", "signal": "server_ready"} @@ -508,11 +505,11 @@ bash server.sh 若 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址 - ``` + ```bash python3 websocket_client.py --server_ip 127.0.0.1 --port 8290 --punc.server_ip 127.0.0.1 --punc.port 8190 --wavfile ./zh.wav ``` - 输出: - ``` + 输出: + ```text [2022-05-07 11:11:02,984] [ INFO] - Start to do streaming asr client [2022-05-07 11:11:02,985] [ INFO] - asr websocket client start [2022-05-07 11:11:02,985] [ INFO] - endpoint: ws://127.0.0.1:8490/paddlespeech/asr/streaming @@ -581,5 +578,3 @@ bash server.sh [2022-05-07 11:11:18,915] [ INFO] - audio duration: 4.9968125, elapsed time: 15.928460597991943, RTF=3.187724293835709 [2022-05-07 11:11:18,916] [ INFO] - asr websocket client finished : 我认为跑步最重要的就是给我带来了身体健康 ``` - - diff --git a/demos/streaming_tts_server/README.md b/demos/streaming_tts_server/README.md index 645bde86a1df8d9398dbd33312166fdf5a61fa97..15448a46f0b64fbf174d76f6273d27346a53b605 100644 --- a/demos/streaming_tts_server/README.md +++ b/demos/streaming_tts_server/README.md @@ -5,15 +5,19 @@ ## Introduction This demo is an implementation of starting the streaming speech synthesis service and accessing the service. It can be achieved with a single command using `paddlespeech_server` and `paddlespeech_client` or a few lines of code in python. +For service interface definition, please check: +- [PaddleSpeech Server RESTful API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-RESTful-API) +- [PaddleSpeech Streaming Server WebSocket API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-WebSocket-API) ## Usage ### 1. Installation see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). -It is recommended to use **paddlepaddle 2.2.2** or above. +It is recommended to use **paddlepaddle 2.3.1** or above. + You can choose one way from easy, meduim and hard to install paddlespeech. -**If you install in simple mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.** +**If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.** ### 2. Prepare config File The configuration file can be found in `conf/tts_online_application.yaml`. @@ -29,11 +33,10 @@ The configuration file can be found in `conf/tts_online_application.yaml`. - Both hifigan and mb_melgan support streaming voc inference. - When the voc model is mb_melgan, when voc_pad=14, the synthetic audio for streaming inference is consistent with the non-streaming synthetic audio; the minimum voc_pad can be set to 7, and the synthetic audio has no abnormal hearing. If the voc_pad is less than 7, the synthetic audio sounds abnormal. - When the voc model is hifigan, when voc_pad=19, the streaming inference synthetic audio is consistent with the non-streaming synthetic audio; when voc_pad=14, the synthetic audio has no abnormal hearing. + - Pad calculation method of streaming vocoder in PaddleSpeech: [AIStudio tutorial](https://aistudio.baidu.com/aistudio/projectdetail/4151335) - Inference speed: mb_melgan > hifigan; Audio quality: mb_melgan < hifigan - **Note:** If the service can be started normally in the container, but the client access IP is unreachable, you can try to replace the `host` address in the configuration file with the local IP address. - - ### 3. Streaming speech synthesis server and client using http protocol #### 3.1 Server Usage - Command Line (Recommended) @@ -53,7 +56,7 @@ The configuration file can be found in `conf/tts_online_application.yaml`. - `log_file`: log file. Default: ./log/paddlespeech.log Output: - ```bash + ```text [2022-04-24 20:05:27,887] [ INFO] - The first response time of the 0 warm up: 1.0123658180236816 s [2022-04-24 20:05:28,038] [ INFO] - The first response time of the 1 warm up: 0.15108466148376465 s [2022-04-24 20:05:28,191] [ INFO] - The first response time of the 2 warm up: 0.15317344665527344 s @@ -79,8 +82,8 @@ The configuration file can be found in `conf/tts_online_application.yaml`. log_file="./log/paddlespeech.log") ``` - Output: - ```bash + Output: + ```text [2022-04-24 21:00:16,934] [ INFO] - The first response time of the 0 warm up: 1.268730878829956 s [2022-04-24 21:00:17,046] [ INFO] - The first response time of the 1 warm up: 0.11168622970581055 s [2022-04-24 21:00:17,151] [ INFO] - The first response time of the 2 warm up: 0.10413002967834473 s @@ -93,8 +96,6 @@ The configuration file can be found in `conf/tts_online_application.yaml`. [2022-04-24 21:00:17] [INFO] [on.py:59] Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) [2022-04-24 21:00:17] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) - - ``` #### 3.2 Streaming TTS client Usage @@ -125,7 +126,7 @@ The configuration file can be found in `conf/tts_online_application.yaml`. - Currently, only the single-speaker model is supported in the code, so `spk_id` does not take effect. Streaming TTS does not support changing sample rate, variable speed and volume. Output: - ```bash + ```text [2022-04-24 21:08:18,559] [ INFO] - tts http client start [2022-04-24 21:08:21,702] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 [2022-04-24 21:08:21,703] [ INFO] - 首包响应:0.18863153457641602 s @@ -154,7 +155,7 @@ The configuration file can be found in `conf/tts_online_application.yaml`. ``` Output: - ```bash + ```text [2022-04-24 21:11:13,798] [ INFO] - tts http client start [2022-04-24 21:11:16,800] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 [2022-04-24 21:11:16,801] [ INFO] - 首包响应:0.18234872817993164 s @@ -164,7 +165,6 @@ The configuration file can be found in `conf/tts_online_application.yaml`. [2022-04-24 21:11:16,837] [ INFO] - 音频保存至:./output.wav ``` - ### 4. Streaming speech synthesis server and client using websocket protocol #### 4.1 Server Usage - Command Line (Recommended) @@ -184,21 +184,19 @@ The configuration file can be found in `conf/tts_online_application.yaml`. - `log_file`: log file. Default: ./log/paddlespeech.log Output: - ```bash - [2022-04-27 10:18:09,107] [ INFO] - The first response time of the 0 warm up: 1.1551103591918945 s - [2022-04-27 10:18:09,219] [ INFO] - The first response time of the 1 warm up: 0.11204338073730469 s - [2022-04-27 10:18:09,324] [ INFO] - The first response time of the 2 warm up: 0.1051797866821289 s - [2022-04-27 10:18:09,325] [ INFO] - ********************************************************************** - INFO: Started server process [17600] - [2022-04-27 10:18:09] [INFO] [server.py:75] Started server process [17600] - INFO: Waiting for application startup. - [2022-04-27 10:18:09] [INFO] [on.py:45] Waiting for application startup. - INFO: Application startup complete. - [2022-04-27 10:18:09] [INFO] [on.py:59] Application startup complete. - INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) - [2022-04-27 10:18:09] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) - - + ```text + [2022-04-27 10:18:09,107] [ INFO] - The first response time of the 0 warm up: 1.1551103591918945 s + [2022-04-27 10:18:09,219] [ INFO] - The first response time of the 1 warm up: 0.11204338073730469 s + [2022-04-27 10:18:09,324] [ INFO] - The first response time of the 2 warm up: 0.1051797866821289 s + [2022-04-27 10:18:09,325] [ INFO] - ********************************************************************** + INFO: Started server process [17600] + [2022-04-27 10:18:09] [INFO] [server.py:75] Started server process [17600] + INFO: Waiting for application startup. + [2022-04-27 10:18:09] [INFO] [on.py:45] Waiting for application startup. + INFO: Application startup complete. + [2022-04-27 10:18:09] [INFO] [on.py:59] Application startup complete. + INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) + [2022-04-27 10:18:09] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) ``` - Python API @@ -212,20 +210,19 @@ The configuration file can be found in `conf/tts_online_application.yaml`. ``` Output: - ```bash - [2022-04-27 10:20:16,660] [ INFO] - The first response time of the 0 warm up: 1.0945196151733398 s - [2022-04-27 10:20:16,773] [ INFO] - The first response time of the 1 warm up: 0.11222052574157715 s - [2022-04-27 10:20:16,878] [ INFO] - The first response time of the 2 warm up: 0.10494542121887207 s - [2022-04-27 10:20:16,878] [ INFO] - ********************************************************************** - INFO: Started server process [23466] - [2022-04-27 10:20:16] [INFO] [server.py:75] Started server process [23466] - INFO: Waiting for application startup. - [2022-04-27 10:20:16] [INFO] [on.py:45] Waiting for application startup. - INFO: Application startup complete. - [2022-04-27 10:20:16] [INFO] [on.py:59] Application startup complete. - INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) - [2022-04-27 10:20:16] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) - + ```text + [2022-04-27 10:20:16,660] [ INFO] - The first response time of the 0 warm up: 1.0945196151733398 s + [2022-04-27 10:20:16,773] [ INFO] - The first response time of the 1 warm up: 0.11222052574157715 s + [2022-04-27 10:20:16,878] [ INFO] - The first response time of the 2 warm up: 0.10494542121887207 s + [2022-04-27 10:20:16,878] [ INFO] - ********************************************************************** + INFO: Started server process [23466] + [2022-04-27 10:20:16] [INFO] [server.py:75] Started server process [23466] + INFO: Waiting for application startup. + [2022-04-27 10:20:16] [INFO] [on.py:45] Waiting for application startup. + INFO: Application startup complete. + [2022-04-27 10:20:16] [INFO] [on.py:59] Application startup complete. + INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) + [2022-04-27 10:20:16] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) ``` #### 4.2 Streaming TTS client Usage @@ -258,7 +255,7 @@ The configuration file can be found in `conf/tts_online_application.yaml`. Output: - ```bash + ```text [2022-04-27 10:21:04,262] [ INFO] - tts websocket client start [2022-04-27 10:21:04,496] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 [2022-04-27 10:21:04,496] [ INFO] - 首包响应:0.2124948501586914 s @@ -266,7 +263,6 @@ The configuration file can be found in `conf/tts_online_application.yaml`. [2022-04-27 10:21:07,484] [ INFO] - 音频时长:3.825 s [2022-04-27 10:21:07,484] [ INFO] - RTF: 0.8363677006141812 [2022-04-27 10:21:07,516] [ INFO] - 音频保存至:output.wav - ``` - Python API @@ -283,21 +279,15 @@ The configuration file can be found in `conf/tts_online_application.yaml`. spk_id=0, output="./output.wav", play=False) - ``` Output: - ```bash - [2022-04-27 10:22:48,852] [ INFO] - tts websocket client start - [2022-04-27 10:22:49,080] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 - [2022-04-27 10:22:49,080] [ INFO] - 首包响应:0.21017956733703613 s - [2022-04-27 10:22:52,100] [ INFO] - 尾包响应:3.2304444313049316 s - [2022-04-27 10:22:52,101] [ INFO] - 音频时长:3.825 s - [2022-04-27 10:22:52,101] [ INFO] - RTF: 0.8445606356352762 - [2022-04-27 10:22:52,134] [ INFO] - 音频保存至:./output.wav - + ```text + [2022-04-27 10:22:48,852] [ INFO] - tts websocket client start + [2022-04-27 10:22:49,080] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 + [2022-04-27 10:22:49,080] [ INFO] - 首包响应:0.21017956733703613 s + [2022-04-27 10:22:52,100] [ INFO] - 尾包响应:3.2304444313049316 s + [2022-04-27 10:22:52,101] [ INFO] - 音频时长:3.825 s + [2022-04-27 10:22:52,101] [ INFO] - RTF: 0.8445606356352762 + [2022-04-27 10:22:52,134] [ INFO] - 音频保存至:./output.wav ``` - - - - diff --git a/demos/streaming_tts_server/README_cn.md b/demos/streaming_tts_server/README_cn.md index c88e3196a8eab86598fa274771f67afaf4cd3931..b99155bca18bfc042b2e4604d839e4c2a97431f6 100644 --- a/demos/streaming_tts_server/README_cn.md +++ b/demos/streaming_tts_server/README_cn.md @@ -3,15 +3,19 @@ # 流式语音合成服务 ## 介绍 -这个demo是一个启动流式语音合成服务和访问该服务的实现。 它可以通过使用`paddlespeech_server` 和 `paddlespeech_client`的单个命令或 python 的几行代码来实现。 - +这个 demo 是一个启动流式语音合成服务和访问该服务的实现。 它可以通过使用 `paddlespeech_server` 和 `paddlespeech_client` 的单个命令或 python 的几行代码来实现。 +服务接口定义请参考: +- [PaddleSpeech Server RESTful API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-RESTful-API) +- [PaddleSpeech Streaming Server WebSocket API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-WebSocket-API) ## 使用方法 ### 1. 安装 请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). -推荐使用 **paddlepaddle 2.2.2** 或以上版本。 -你可以从简单,中等,困难几种方式中选择一种方式安装 PaddleSpeech。 +推荐使用 **paddlepaddle 2.3.1** 或以上版本。 + +你可以从简单,中等,困难 几种方式中选择一种方式安装 PaddleSpeech。 + **如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。** ### 2. 准备配置文件 @@ -20,19 +24,20 @@ - `engine_list` 表示即将启动的服务将会包含的语音引擎,格式为 <语音任务>_<引擎类型>。 - 该 demo 主要介绍流式语音合成服务,因此语音任务应设置为 tts。 - 目前引擎类型支持两种形式:**online** 表示使用python进行动态图推理的引擎;**online-onnx** 表示使用 onnxruntime 进行推理的引擎。其中,online-onnx 的推理速度更快。 -- 流式 TTS 引擎的 AM 模型支持:**fastspeech2 以及fastspeech2_cnndecoder**; Voc 模型支持:**hifigan, mb_melgan** +- 流式 TTS 引擎的 AM 模型支持:**fastspeech2 以及 fastspeech2_cnndecoder**; Voc 模型支持:**hifigan, mb_melgan** - 流式 am 推理中,每次会对一个 chunk 的数据进行推理以达到流式的效果。其中 `am_block` 表示 chunk 中的有效帧数,`am_pad` 表示一个 chunk 中 am_block 前后各加的帧数。am_pad 的存在用于消除流式推理产生的误差,避免由流式推理对合成音频质量的影响。 - - fastspeech2 不支持流式 am 推理,因此 am_pad 与 m_block 对它无效 + - fastspeech2 不支持流式 am 推理,因此 am_pad 与 am_block 对它无效 - fastspeech2_cnndecoder 支持流式推理,当 am_pad=12 时,流式推理合成音频与非流式合成音频一致 -- 流式 voc 推理中,每次会对一个 chunk 的数据进行推理以达到流式的效果。其中 `voc_block` 表示chunk中的有效帧数,`voc_pad` 表示一个 chunk 中 voc_block 前后各加的帧数。voc_pad 的存在用于消除流式推理产生的误差,避免由流式推理对合成音频质量的影响。 +- 流式 voc 推理中,每次会对一个 chunk 的数据进行推理以达到流式的效果。其中 `voc_block` 表示 chunk 中的有效帧数,`voc_pad` 表示一个 chunk 中 voc_block 前后各加的帧数。voc_pad 的存在用于消除流式推理产生的误差,避免由流式推理对合成音频质量的影响。 - hifigan, mb_melgan 均支持流式 voc 推理 - 当 voc 模型为 mb_melgan,当 voc_pad=14 时,流式推理合成音频与非流式合成音频一致;voc_pad 最小可以设置为7,合成音频听感上没有异常,若 voc_pad 小于7,合成音频听感上存在异常。 - 当 voc 模型为 hifigan,当 voc_pad=19 时,流式推理合成音频与非流式合成音频一致;当 voc_pad=14 时,合成音频听感上没有异常。 + - PaddleSpeech 中流式声码器 Pad 计算方法: [AIStudio 教程](https://aistudio.baidu.com/aistudio/projectdetail/4151335) - 推理速度:mb_melgan > hifigan; 音频质量:mb_melgan < hifigan - **注意:** 如果在容器里可正常启动服务,但客户端访问 ip 不可达,可尝试将配置文件中 `host` 地址换成本地 ip 地址。 -### 3. 使用http协议的流式语音合成服务端及客户端使用方法 +### 3. 使用 http 协议的流式语音合成服务端及客户端使用方法 #### 3.1 服务端使用方法 - 命令行 (推荐使用) @@ -51,7 +56,7 @@ - `log_file`: log 文件. 默认:./log/paddlespeech.log 输出: - ```bash + ```text [2022-04-24 20:05:27,887] [ INFO] - The first response time of the 0 warm up: 1.0123658180236816 s [2022-04-24 20:05:28,038] [ INFO] - The first response time of the 1 warm up: 0.15108466148376465 s [2022-04-24 20:05:28,191] [ INFO] - The first response time of the 2 warm up: 0.15317344665527344 s @@ -64,7 +69,6 @@ [2022-04-24 20:05:28] [INFO] [on.py:59] Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) [2022-04-24 20:05:28] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) - ``` - Python API @@ -77,8 +81,8 @@ log_file="./log/paddlespeech.log") ``` - 输出: - ```bash + 输出: + ```text [2022-04-24 21:00:16,934] [ INFO] - The first response time of the 0 warm up: 1.268730878829956 s [2022-04-24 21:00:17,046] [ INFO] - The first response time of the 1 warm up: 0.11168622970581055 s [2022-04-24 21:00:17,151] [ INFO] - The first response time of the 2 warm up: 0.10413002967834473 s @@ -91,8 +95,6 @@ [2022-04-24 21:00:17] [INFO] [on.py:59] Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) [2022-04-24 21:00:17] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) - - ``` #### 3.2 客户端使用方法 @@ -124,7 +126,7 @@ 输出: - ```bash + ```text [2022-04-24 21:08:18,559] [ INFO] - tts http client start [2022-04-24 21:08:21,702] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 [2022-04-24 21:08:21,703] [ INFO] - 首包响应:0.18863153457641602 s @@ -162,9 +164,8 @@ [2022-04-24 21:11:16,802] [ INFO] - RTF: 0.7846773683635238 [2022-04-24 21:11:16,837] [ INFO] - 音频保存至:./output.wav ``` - -### 4. 使用websocket协议的流式语音合成服务端及客户端使用方法 +### 4. 使用 websocket 协议的流式语音合成服务端及客户端使用方法 #### 4.1 服务端使用方法 - 命令行 (推荐使用) 首先修改配置文件 `conf/tts_online_application.yaml`, **将 `protocol` 设置为 `websocket`**。 @@ -183,21 +184,19 @@ - `log_file`: log 文件. 默认:./log/paddlespeech.log 输出: - ```bash - [2022-04-27 10:18:09,107] [ INFO] - The first response time of the 0 warm up: 1.1551103591918945 s - [2022-04-27 10:18:09,219] [ INFO] - The first response time of the 1 warm up: 0.11204338073730469 s - [2022-04-27 10:18:09,324] [ INFO] - The first response time of the 2 warm up: 0.1051797866821289 s - [2022-04-27 10:18:09,325] [ INFO] - ********************************************************************** - INFO: Started server process [17600] - [2022-04-27 10:18:09] [INFO] [server.py:75] Started server process [17600] - INFO: Waiting for application startup. - [2022-04-27 10:18:09] [INFO] [on.py:45] Waiting for application startup. - INFO: Application startup complete. - [2022-04-27 10:18:09] [INFO] [on.py:59] Application startup complete. - INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) - [2022-04-27 10:18:09] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) - - + ```text + [2022-04-27 10:18:09,107] [ INFO] - The first response time of the 0 warm up: 1.1551103591918945 s + [2022-04-27 10:18:09,219] [ INFO] - The first response time of the 1 warm up: 0.11204338073730469 s + [2022-04-27 10:18:09,324] [ INFO] - The first response time of the 2 warm up: 0.1051797866821289 s + [2022-04-27 10:18:09,325] [ INFO] - ********************************************************************** + INFO: Started server process [17600] + [2022-04-27 10:18:09] [INFO] [server.py:75] Started server process [17600] + INFO: Waiting for application startup. + [2022-04-27 10:18:09] [INFO] [on.py:45] Waiting for application startup. + INFO: Application startup complete. + [2022-04-27 10:18:09] [INFO] [on.py:59] Application startup complete. + INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) + [2022-04-27 10:18:09] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) ``` - Python API @@ -210,27 +209,26 @@ log_file="./log/paddlespeech.log") ``` - 输出: - ```bash - [2022-04-27 10:20:16,660] [ INFO] - The first response time of the 0 warm up: 1.0945196151733398 s - [2022-04-27 10:20:16,773] [ INFO] - The first response time of the 1 warm up: 0.11222052574157715 s - [2022-04-27 10:20:16,878] [ INFO] - The first response time of the 2 warm up: 0.10494542121887207 s - [2022-04-27 10:20:16,878] [ INFO] - ********************************************************************** - INFO: Started server process [23466] - [2022-04-27 10:20:16] [INFO] [server.py:75] Started server process [23466] - INFO: Waiting for application startup. - [2022-04-27 10:20:16] [INFO] [on.py:45] Waiting for application startup. - INFO: Application startup complete. - [2022-04-27 10:20:16] [INFO] [on.py:59] Application startup complete. - INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) - [2022-04-27 10:20:16] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) - + 输出: + ```text + [2022-04-27 10:20:16,660] [ INFO] - The first response time of the 0 warm up: 1.0945196151733398 s + [2022-04-27 10:20:16,773] [ INFO] - The first response time of the 1 warm up: 0.11222052574157715 s + [2022-04-27 10:20:16,878] [ INFO] - The first response time of the 2 warm up: 0.10494542121887207 s + [2022-04-27 10:20:16,878] [ INFO] - ********************************************************************** + INFO: Started server process [23466] + [2022-04-27 10:20:16] [INFO] [server.py:75] Started server process [23466] + INFO: Waiting for application startup. + [2022-04-27 10:20:16] [INFO] [on.py:45] Waiting for application startup. + INFO: Application startup complete. + [2022-04-27 10:20:16] [INFO] [on.py:59] Application startup complete. + INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) + [2022-04-27 10:20:16] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) ``` #### 4.2 客户端使用方法 - 命令行 (推荐使用) - 访问 websocket 流式TTS服务: + 访问 websocket 流式 TTS 服务: 若 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址 @@ -255,9 +253,8 @@ - 目前代码中只支持单说话人的模型,因此 spk_id 的选择并不生效。流式 TTS 不支持更换采样率,变速和变音量等功能。 - 输出: - ```bash + ```text [2022-04-27 10:21:04,262] [ INFO] - tts websocket client start [2022-04-27 10:21:04,496] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 [2022-04-27 10:21:04,496] [ INFO] - 首包响应:0.2124948501586914 s @@ -265,7 +262,6 @@ [2022-04-27 10:21:07,484] [ INFO] - 音频时长:3.825 s [2022-04-27 10:21:07,484] [ INFO] - RTF: 0.8363677006141812 [2022-04-27 10:21:07,516] [ INFO] - 音频保存至:output.wav - ``` - Python API @@ -282,11 +278,10 @@ spk_id=0, output="./output.wav", play=False) - ``` 输出: - ```bash + ```text [2022-04-27 10:22:48,852] [ INFO] - tts websocket client start [2022-04-27 10:22:49,080] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 [2022-04-27 10:22:49,080] [ INFO] - 首包响应:0.21017956733703613 s @@ -294,8 +289,4 @@ [2022-04-27 10:22:52,101] [ INFO] - 音频时长:3.825 s [2022-04-27 10:22:52,101] [ INFO] - RTF: 0.8445606356352762 [2022-04-27 10:22:52,134] [ INFO] - 音频保存至:./output.wav - ``` - - - diff --git a/docker/ubuntu18-cpu/Dockerfile b/docker/ubuntu18-cpu/Dockerfile index d14c01858908fbe79f44f7a2b63cdb9520cabb55..35f45f2e48246f7026c4a3fd9a399ae89e76cce7 100644 --- a/docker/ubuntu18-cpu/Dockerfile +++ b/docker/ubuntu18-cpu/Dockerfile @@ -1,15 +1,17 @@ FROM registry.baidubce.com/paddlepaddle/paddle:2.2.2 LABEL maintainer="paddlesl@baidu.com" +RUN apt-get update \ + && apt-get install libsndfile-dev \ + && apt-get clean \ + && rm -rf /var/lib/apt/lists/* + RUN git clone --depth 1 https://github.com/PaddlePaddle/PaddleSpeech.git /home/PaddleSpeech RUN pip3 uninstall mccabe -y ; exit 0; RUN pip3 install multiprocess==0.70.12 importlib-metadata==4.2.0 dill==0.3.4 -RUN cd /home/PaddleSpeech/audio -RUN python setup.py bdist_wheel - -RUN cd /home/PaddleSpeech +WORKDIR /home/PaddleSpeech/ RUN python setup.py bdist_wheel -RUN pip install audio/dist/*.whl dist/*.whl +RUN pip install dist/*.whl -i https://pypi.tuna.tsinghua.edu.cn/simple -WORKDIR /home/PaddleSpeech/ +CMD ['bash'] diff --git a/docs/requirements.txt b/docs/requirements.txt index 08a049c1be0089cb236cfd1439985ae2665b44fd..bf1486c5e22fb3e08b5dd069810b54755f10505e 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -48,4 +48,5 @@ fastapi websockets keyboard uvicorn -pattern_singleton \ No newline at end of file +pattern_singleton +braceexpand \ No newline at end of file diff --git a/examples/aishell3/ernie_sat/conf/default.yaml b/examples/aishell3/ernie_sat/conf/default.yaml new file mode 100644 index 0000000000000000000000000000000000000000..fdc767fb0dc19ae3a25ac6dec4f339da17f5c3e0 --- /dev/null +++ b/examples/aishell3/ernie_sat/conf/default.yaml @@ -0,0 +1,283 @@ +########################################################### +# FEATURE EXTRACTION SETTING # +########################################################### + +fs: 24000 # sr +n_fft: 2048 # FFT size (samples). +n_shift: 300 # Hop size (samples). 12.5ms +win_length: 1200 # Window length (samples). 50ms + # If set to null, it will be the same as fft_size. +window: "hann" # Window function. + +# Only used for feats_type != raw + +fmin: 80 # Minimum frequency of Mel basis. +fmax: 7600 # Maximum frequency of Mel basis. +n_mels: 80 # The number of mel basis. + +mean_phn_span: 8 +mlm_prob: 0.8 + +########################################################### +# DATA SETTING # +########################################################### +batch_size: 20 +num_workers: 2 + +########################################################### +# MODEL SETTING # +########################################################### +model: + text_masking: false + postnet_layers: 5 + postnet_filts: 5 + postnet_chans: 256 + encoder_type: conformer + decoder_type: conformer + enc_input_layer: sega_mlm + enc_pre_speech_layer: 0 + enc_cnn_module_kernel: 7 + enc_attention_dim: 384 + enc_attention_heads: 2 + enc_linear_units: 1536 + enc_num_blocks: 4 + enc_dropout_rate: 0.2 + enc_positional_dropout_rate: 0.2 + enc_attention_dropout_rate: 0.2 + enc_normalize_before: true + enc_macaron_style: true + enc_use_cnn_module: true + enc_selfattention_layer_type: legacy_rel_selfattn + enc_activation_type: swish + enc_pos_enc_layer_type: legacy_rel_pos + enc_positionwise_layer_type: conv1d + enc_positionwise_conv_kernel_size: 3 + dec_cnn_module_kernel: 31 + dec_attention_dim: 384 + dec_attention_heads: 2 + dec_linear_units: 1536 + dec_num_blocks: 4 + dec_dropout_rate: 0.2 + dec_positional_dropout_rate: 0.2 + dec_attention_dropout_rate: 0.2 + dec_macaron_style: true + dec_use_cnn_module: true + dec_selfattention_layer_type: legacy_rel_selfattn + dec_activation_type: swish + dec_pos_enc_layer_type: legacy_rel_pos + dec_positionwise_layer_type: conv1d + dec_positionwise_conv_kernel_size: 3 + +########################################################### +# OPTIMIZER SETTING # +########################################################### +scheduler_params: + d_model: 384 + warmup_steps: 4000 +grad_clip: 1.0 + +########################################################### +# TRAINING SETTING # +########################################################### +max_epoch: 1500 +num_snapshots: 50 + +########################################################### +# OTHER SETTING # +########################################################### +seed: 0 + +token_list: +- +- +- d +- sp +- sh +- ii +- j +- zh +- l +- x +- b +- g +- uu +- e5 +- h +- q +- m +- i1 +- t +- z +- ch +- f +- s +- u4 +- ix4 +- i4 +- n +- i3 +- iu3 +- vv +- ian4 +- ix2 +- r +- e4 +- ai4 +- k +- ing2 +- a1 +- en2 +- ui4 +- ong1 +- uo3 +- u2 +- u3 +- ao4 +- ee +- p +- an1 +- eng2 +- i2 +- in1 +- c +- ai2 +- ian2 +- e2 +- an4 +- ing4 +- v4 +- ai3 +- a5 +- ian3 +- eng1 +- ong4 +- ang4 +- ian1 +- ing1 +- iy4 +- ao3 +- ang1 +- uo4 +- u1 +- iao4 +- iu4 +- a4 +- van2 +- ie4 +- ang2 +- ou4 +- iang4 +- ix1 +- er4 +- iy1 +- e1 +- en1 +- ui2 +- an3 +- ei4 +- ong2 +- uo1 +- ou3 +- uo2 +- iao1 +- ou1 +- an2 +- uan4 +- ia4 +- ia1 +- ang3 +- v3 +- iu2 +- iao3 +- in4 +- a3 +- ei3 +- iang3 +- v2 +- eng4 +- en3 +- aa +- uan1 +- v1 +- ao1 +- ve4 +- ie3 +- ai1 +- ing3 +- iang1 +- a2 +- ui1 +- en4 +- en5 +- in3 +- uan3 +- e3 +- ie1 +- ve2 +- ei2 +- in2 +- ix3 +- uan2 +- iang2 +- ie2 +- ua4 +- ou2 +- uai4 +- er2 +- eng3 +- uang3 +- un1 +- ong3 +- uang4 +- vn4 +- un2 +- iy3 +- iz4 +- ui3 +- iao2 +- iong4 +- un4 +- van4 +- ao2 +- uang1 +- iy5 +- o2 +- ei1 +- ua1 +- iu1 +- uang2 +- er5 +- o1 +- un3 +- vn1 +- vn2 +- o4 +- ve1 +- van3 +- ua2 +- er3 +- iong3 +- van1 +- ia2 +- iy2 +- ia3 +- iong1 +- uo5 +- oo +- ve3 +- ou5 +- uai3 +- ian5 +- iong2 +- uai2 +- uai1 +- ua3 +- vn3 +- ia5 +- ie5 +- ueng1 +- o5 +- o3 +- iang5 +- ei5 +- \ No newline at end of file diff --git a/examples/aishell3/ernie_sat/local/preprocess.sh b/examples/aishell3/ernie_sat/local/preprocess.sh new file mode 100755 index 0000000000000000000000000000000000000000..d7a91d08f12b1197dab3dcb5fefa848249cde6d4 --- /dev/null +++ b/examples/aishell3/ernie_sat/local/preprocess.sh @@ -0,0 +1,61 @@ +#!/bin/bash + +stage=0 +stop_stage=100 + +config_path=$1 + +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + # get durations from MFA's result + echo "Generate durations.txt from MFA results ..." + python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \ + --inputdir=./aishell3_alignment_tone \ + --output durations.txt \ + --config=${config_path} +fi + +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + # extract features + echo "Extract features ..." + python3 ${BIN_DIR}/preprocess.py \ + --dataset=aishell3 \ + --rootdir=~/datasets/data_aishell3/ \ + --dumpdir=dump \ + --dur-file=durations.txt \ + --config=${config_path} \ + --num-cpu=20 \ + --cut-sil=True +fi + +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + # get features' stats(mean and std) + echo "Get features' stats ..." + python3 ${MAIN_ROOT}/utils/compute_statistics.py \ + --metadata=dump/train/raw/metadata.jsonl \ + --field-name="speech" +fi + +if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then + # normalize and covert phone/speaker to id, dev and test should use train's stats + echo "Normalize ..." + python3 ${BIN_DIR}/normalize.py \ + --metadata=dump/train/raw/metadata.jsonl \ + --dumpdir=dump/train/norm \ + --speech-stats=dump/train/speech_stats.npy \ + --phones-dict=dump/phone_id_map.txt \ + --speaker-dict=dump/speaker_id_map.txt + + python3 ${BIN_DIR}/normalize.py \ + --metadata=dump/dev/raw/metadata.jsonl \ + --dumpdir=dump/dev/norm \ + --speech-stats=dump/train/speech_stats.npy \ + --phones-dict=dump/phone_id_map.txt \ + --speaker-dict=dump/speaker_id_map.txt + + python3 ${BIN_DIR}/normalize.py \ + --metadata=dump/test/raw/metadata.jsonl \ + --dumpdir=dump/test/norm \ + --speech-stats=dump/train/speech_stats.npy \ + --phones-dict=dump/phone_id_map.txt \ + --speaker-dict=dump/speaker_id_map.txt +fi diff --git a/examples/aishell3/ernie_sat/local/synthesize.sh b/examples/aishell3/ernie_sat/local/synthesize.sh new file mode 100755 index 0000000000000000000000000000000000000000..3e907427c9e402db2326dee82b6f717268e61f61 --- /dev/null +++ b/examples/aishell3/ernie_sat/local/synthesize.sh @@ -0,0 +1,42 @@ +#!/bin/bash + +config_path=$1 +train_output_path=$2 +ckpt_name=$3 + +stage=1 +stop_stage=1 + +# pwgan +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/synthesize.py \ + --erniesat_config=${config_path} \ + --erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --erniesat_stat=dump/train/speech_stats.npy \ + --voc=pwgan_aishell3 \ + --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \ + --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ + --voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ + --test_metadata=dump/test/norm/metadata.jsonl \ + --output_dir=${train_output_path}/test \ + --phones_dict=dump/phone_id_map.txt +fi + +# hifigan +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/synthesize.py \ + --erniesat_config=${config_path} \ + --erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --erniesat_stat=dump/train/speech_stats.npy \ + --voc=hifigan_aishell3 \ + --voc_config=hifigan_aishell3_ckpt_0.2.0/default.yaml \ + --voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pdz \ + --voc_stat=hifigan_aishell3_ckpt_0.2.0/feats_stats.npy \ + --test_metadata=dump/test/norm/metadata.jsonl \ + --output_dir=${train_output_path}/test \ + --phones_dict=dump/phone_id_map.txt +fi diff --git a/examples/aishell3/ernie_sat/local/train.sh b/examples/aishell3/ernie_sat/local/train.sh new file mode 100755 index 0000000000000000000000000000000000000000..30720e8f5b7ed289be91765e42a6e05bdf279c4a --- /dev/null +++ b/examples/aishell3/ernie_sat/local/train.sh @@ -0,0 +1,12 @@ +#!/bin/bash + +config_path=$1 +train_output_path=$2 + +python3 ${BIN_DIR}/train.py \ + --train-metadata=dump/train/norm/metadata.jsonl \ + --dev-metadata=dump/dev/norm/metadata.jsonl \ + --config=${config_path} \ + --output-dir=${train_output_path} \ + --ngpu=2 \ + --phones-dict=dump/phone_id_map.txt \ No newline at end of file diff --git a/examples/aishell3/ernie_sat/path.sh b/examples/aishell3/ernie_sat/path.sh new file mode 100755 index 0000000000000000000000000000000000000000..4ecab02517d5a4bd59baf95bb5f536e263ce7ac0 --- /dev/null +++ b/examples/aishell3/ernie_sat/path.sh @@ -0,0 +1,13 @@ +#!/bin/bash +export MAIN_ROOT=`realpath ${PWD}/../../../` + +export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH} +export LC_ALL=C + +export PYTHONDONTWRITEBYTECODE=1 +# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C +export PYTHONIOENCODING=UTF-8 +export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH} + +MODEL=ernie_sat +export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL} \ No newline at end of file diff --git a/examples/aishell3/ernie_sat/run.sh b/examples/aishell3/ernie_sat/run.sh new file mode 100755 index 0000000000000000000000000000000000000000..d75a19f23481f152d8bb59212732d457f325b4ea --- /dev/null +++ b/examples/aishell3/ernie_sat/run.sh @@ -0,0 +1,32 @@ +#!/bin/bash + +set -e +source path.sh + +gpus=0,1 +stage=0 +stop_stage=100 + +conf_path=conf/default.yaml +train_output_path=exp/default +ckpt_name=snapshot_iter_153.pdz + +# with the following command, you can choose the stage range you want to run +# such as `./run.sh --stage 0 --stop-stage 0` +# this can not be mixed use with `$1`, `$2` ... +source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 + +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + # prepare data + ./local/preprocess.sh ${conf_path} || exit -1 +fi + +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + # train model, all `ckpt` under `train_output_path/checkpoints/` dir + CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1 +fi + +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + # synthesize, vocoder is pwgan + CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1 +fi diff --git a/examples/aishell3/tts3/conf/conformer.yaml b/examples/aishell3/tts3/conf/conformer.yaml index ea73593d77a3117e8b46baab9785bd576a66a093..0834bfe3f30512016fdd0524617f6339720c4055 100644 --- a/examples/aishell3/tts3/conf/conformer.yaml +++ b/examples/aishell3/tts3/conf/conformer.yaml @@ -94,8 +94,8 @@ updater: # OPTIMIZER SETTING # ########################################################### optimizer: - optim: adam # optimizer type - learning_rate: 0.001 # learning rate + optim: adam # optimizer type + learning_rate: 0.001 # learning rate ########################################################### # TRAINING SETTING # diff --git a/examples/aishell3/tts3/conf/default.yaml b/examples/aishell3/tts3/conf/default.yaml index ac4956742ebf568a971c24ac93b6ba0c19765211..e65b5d0ec44d01e08d28a038fc68487c7311b471 100644 --- a/examples/aishell3/tts3/conf/default.yaml +++ b/examples/aishell3/tts3/conf/default.yaml @@ -88,8 +88,8 @@ updater: # OPTIMIZER SETTING # ########################################################### optimizer: - optim: adam # optimizer type - learning_rate: 0.001 # learning rate + optim: adam # optimizer type + learning_rate: 0.001 # learning rate ########################################################### # TRAINING SETTING # diff --git a/examples/aishell3/tts3/local/synthesize.sh b/examples/aishell3/tts3/local/synthesize.sh index d3978833faae98ba6f89414e9b16caaa18be2489..9134e04264c3f3937d6d0bf560e318db5312a027 100755 --- a/examples/aishell3/tts3/local/synthesize.sh +++ b/examples/aishell3/tts3/local/synthesize.sh @@ -37,7 +37,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then --am_stat=dump/train/speech_stats.npy \ --voc=hifigan_aishell3 \ --voc_config=hifigan_aishell3_ckpt_0.2.0/default.yaml \ - --voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pd \ + --voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pdz \ --voc_stat=hifigan_aishell3_ckpt_0.2.0/feats_stats.npy \ --test_metadata=dump/test/norm/metadata.jsonl \ --output_dir=${train_output_path}/test \ diff --git a/examples/aishell3_vctk/ernie_sat/conf/default.yaml b/examples/aishell3_vctk/ernie_sat/conf/default.yaml new file mode 100644 index 0000000000000000000000000000000000000000..abb69fcc02c20eb9e4ee4da4a825f1dc3d58aeac --- /dev/null +++ b/examples/aishell3_vctk/ernie_sat/conf/default.yaml @@ -0,0 +1,352 @@ +########################################################### +# FEATURE EXTRACTION SETTING # +########################################################### + +fs: 24000 # sr +n_fft: 2048 # FFT size (samples). +n_shift: 300 # Hop size (samples). 12.5ms +win_length: 1200 # Window length (samples). 50ms + # If set to null, it will be the same as fft_size. +window: "hann" # Window function. + +# Only used for feats_type != raw + +fmin: 80 # Minimum frequency of Mel basis. +fmax: 7600 # Maximum frequency of Mel basis. +n_mels: 80 # The number of mel basis. + +mean_phn_span: 8 +mlm_prob: 0.8 + +########################################################### +# DATA SETTING # +########################################################### +batch_size: 20 +num_workers: 2 + +########################################################### +# MODEL SETTING # +########################################################### +model: + text_masking: true + postnet_layers: 5 + postnet_filts: 5 + postnet_chans: 256 + encoder_type: conformer + decoder_type: conformer + enc_input_layer: sega_mlm + enc_pre_speech_layer: 0 + enc_cnn_module_kernel: 7 + enc_attention_dim: 384 + enc_attention_heads: 2 + enc_linear_units: 1536 + enc_num_blocks: 4 + enc_dropout_rate: 0.2 + enc_positional_dropout_rate: 0.2 + enc_attention_dropout_rate: 0.2 + enc_normalize_before: true + enc_macaron_style: true + enc_use_cnn_module: true + enc_selfattention_layer_type: legacy_rel_selfattn + enc_activation_type: swish + enc_pos_enc_layer_type: legacy_rel_pos + enc_positionwise_layer_type: conv1d + enc_positionwise_conv_kernel_size: 3 + dec_cnn_module_kernel: 31 + dec_attention_dim: 384 + dec_attention_heads: 2 + dec_linear_units: 1536 + dec_num_blocks: 4 + dec_dropout_rate: 0.2 + dec_positional_dropout_rate: 0.2 + dec_attention_dropout_rate: 0.2 + dec_macaron_style: true + dec_use_cnn_module: true + dec_selfattention_layer_type: legacy_rel_selfattn + dec_activation_type: swish + dec_pos_enc_layer_type: legacy_rel_pos + dec_positionwise_layer_type: conv1d + dec_positionwise_conv_kernel_size: 3 + +########################################################### +# OPTIMIZER SETTING # +########################################################### +scheduler_params: + d_model: 384 + warmup_steps: 4000 +grad_clip: 1.0 + +########################################################### +# TRAINING SETTING # +########################################################### +max_epoch: 700 +num_snapshots: 50 + +########################################################### +# OTHER SETTING # +########################################################### +seed: 0 + +token_list: +- +- +- AH0 +- T +- N +- sp +- S +- R +- D +- L +- Z +- DH +- IH1 +- K +- W +- M +- EH1 +- AE1 +- ER0 +- B +- IY1 +- P +- V +- IY0 +- F +- HH +- AA1 +- AY1 +- AH1 +- EY1 +- IH0 +- AO1 +- OW1 +- UW1 +- G +- NG +- SH +- Y +- TH +- ER1 +- JH +- UH1 +- AW1 +- CH +- IH2 +- OW0 +- OW2 +- EY2 +- EH2 +- UW0 +- OY1 +- ZH +- EH0 +- AY2 +- AW2 +- AA2 +- AE2 +- IY2 +- AH2 +- AE0 +- AO2 +- AY0 +- AO0 +- UW2 +- UH2 +- AA0 +- EY0 +- AW0 +- UH0 +- ER2 +- OY2 +- OY0 +- d +- sh +- ii +- j +- zh +- l +- x +- b +- g +- uu +- e5 +- h +- q +- m +- i1 +- t +- z +- ch +- f +- s +- u4 +- ix4 +- i4 +- n +- i3 +- iu3 +- vv +- ian4 +- ix2 +- r +- e4 +- ai4 +- k +- ing2 +- a1 +- en2 +- ui4 +- ong1 +- uo3 +- u2 +- u3 +- ao4 +- ee +- p +- an1 +- eng2 +- i2 +- in1 +- c +- ai2 +- ian2 +- e2 +- an4 +- ing4 +- v4 +- ai3 +- a5 +- ian3 +- eng1 +- ong4 +- ang4 +- ian1 +- ing1 +- iy4 +- ao3 +- ang1 +- uo4 +- u1 +- iao4 +- iu4 +- a4 +- van2 +- ie4 +- ang2 +- ou4 +- iang4 +- ix1 +- er4 +- iy1 +- e1 +- en1 +- ui2 +- an3 +- ei4 +- ong2 +- uo1 +- ou3 +- uo2 +- iao1 +- ou1 +- an2 +- uan4 +- ia4 +- ia1 +- ang3 +- v3 +- iu2 +- iao3 +- in4 +- a3 +- ei3 +- iang3 +- v2 +- eng4 +- en3 +- aa +- uan1 +- v1 +- ao1 +- ve4 +- ie3 +- ai1 +- ing3 +- iang1 +- a2 +- ui1 +- en4 +- en5 +- in3 +- uan3 +- e3 +- ie1 +- ve2 +- ei2 +- in2 +- ix3 +- uan2 +- iang2 +- ie2 +- ua4 +- ou2 +- uai4 +- er2 +- eng3 +- uang3 +- un1 +- ong3 +- uang4 +- vn4 +- un2 +- iy3 +- iz4 +- ui3 +- iao2 +- iong4 +- un4 +- van4 +- ao2 +- uang1 +- iy5 +- o2 +- ei1 +- ua1 +- iu1 +- uang2 +- er5 +- o1 +- un3 +- vn1 +- vn2 +- o4 +- ve1 +- van3 +- ua2 +- er3 +- iong3 +- van1 +- ia2 +- iy2 +- ia3 +- iong1 +- uo5 +- oo +- ve3 +- ou5 +- uai3 +- ian5 +- iong2 +- uai2 +- uai1 +- ua3 +- vn3 +- ia5 +- ie5 +- ueng1 +- o5 +- o3 +- iang5 +- ei5 +- diff --git a/examples/aishell3_vctk/ernie_sat/local/preprocess.sh b/examples/aishell3_vctk/ernie_sat/local/preprocess.sh new file mode 100755 index 0000000000000000000000000000000000000000..783fd6333340ec4028a1fe7063c0335231ce0f98 --- /dev/null +++ b/examples/aishell3_vctk/ernie_sat/local/preprocess.sh @@ -0,0 +1,89 @@ +#!/bin/bash + +stage=0 +stop_stage=100 + +config_path=$1 + +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + # get durations from MFA's result + echo "Generate durations.txt from MFA results for aishell3 ..." + python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \ + --inputdir=./aishell3_alignment_tone \ + --output durations_aishell3.txt \ + --config=${config_path} +fi + +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + # get durations from MFA's result + echo "Generate durations.txt from MFA results for vctk ..." + python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \ + --inputdir=./vctk_alignment \ + --output durations_vctk.txt \ + --config=${config_path} +fi + +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + # get durations from MFA's result + echo "concat durations_aishell3.txt and durations_vctk.txt to durations.txt" + cat durations_aishell3.txt durations_vctk.txt > durations.txt +fi + +if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then + # extract features + echo "Extract features ..." + python3 ${BIN_DIR}/preprocess.py \ + --dataset=aishell3 \ + --rootdir=~/datasets/data_aishell3/ \ + --dumpdir=dump \ + --dur-file=durations.txt \ + --config=${config_path} \ + --num-cpu=20 \ + --cut-sil=True +fi + +if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then + # extract features + echo "Extract features ..." + python3 ${BIN_DIR}/preprocess.py \ + --dataset=vctk \ + --rootdir=~/datasets/VCTK-Corpus-0.92/ \ + --dumpdir=dump \ + --dur-file=durations.txt \ + --config=${config_path} \ + --num-cpu=20 \ + --cut-sil=True +fi + +if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then + # get features' stats(mean and std) + echo "Get features' stats ..." + python3 ${MAIN_ROOT}/utils/compute_statistics.py \ + --metadata=dump/train/raw/metadata.jsonl \ + --field-name="speech" +fi + +if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then + # normalize and covert phone/speaker to id, dev and test should use train's stats + echo "Normalize ..." + python3 ${BIN_DIR}/normalize.py \ + --metadata=dump/train/raw/metadata.jsonl \ + --dumpdir=dump/train/norm \ + --speech-stats=dump/train/speech_stats.npy \ + --phones-dict=dump/phone_id_map.txt \ + --speaker-dict=dump/speaker_id_map.txt + + python3 ${BIN_DIR}/normalize.py \ + --metadata=dump/dev/raw/metadata.jsonl \ + --dumpdir=dump/dev/norm \ + --speech-stats=dump/train/speech_stats.npy \ + --phones-dict=dump/phone_id_map.txt \ + --speaker-dict=dump/speaker_id_map.txt + + python3 ${BIN_DIR}/normalize.py \ + --metadata=dump/test/raw/metadata.jsonl \ + --dumpdir=dump/test/norm \ + --speech-stats=dump/train/speech_stats.npy \ + --phones-dict=dump/phone_id_map.txt \ + --speaker-dict=dump/speaker_id_map.txt +fi diff --git a/examples/aishell3_vctk/ernie_sat/local/synthesize.sh b/examples/aishell3_vctk/ernie_sat/local/synthesize.sh new file mode 100755 index 0000000000000000000000000000000000000000..3e907427c9e402db2326dee82b6f717268e61f61 --- /dev/null +++ b/examples/aishell3_vctk/ernie_sat/local/synthesize.sh @@ -0,0 +1,42 @@ +#!/bin/bash + +config_path=$1 +train_output_path=$2 +ckpt_name=$3 + +stage=1 +stop_stage=1 + +# pwgan +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/synthesize.py \ + --erniesat_config=${config_path} \ + --erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --erniesat_stat=dump/train/speech_stats.npy \ + --voc=pwgan_aishell3 \ + --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \ + --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ + --voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ + --test_metadata=dump/test/norm/metadata.jsonl \ + --output_dir=${train_output_path}/test \ + --phones_dict=dump/phone_id_map.txt +fi + +# hifigan +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/synthesize.py \ + --erniesat_config=${config_path} \ + --erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --erniesat_stat=dump/train/speech_stats.npy \ + --voc=hifigan_aishell3 \ + --voc_config=hifigan_aishell3_ckpt_0.2.0/default.yaml \ + --voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pdz \ + --voc_stat=hifigan_aishell3_ckpt_0.2.0/feats_stats.npy \ + --test_metadata=dump/test/norm/metadata.jsonl \ + --output_dir=${train_output_path}/test \ + --phones_dict=dump/phone_id_map.txt +fi diff --git a/examples/aishell3_vctk/ernie_sat/local/train.sh b/examples/aishell3_vctk/ernie_sat/local/train.sh new file mode 100755 index 0000000000000000000000000000000000000000..30720e8f5b7ed289be91765e42a6e05bdf279c4a --- /dev/null +++ b/examples/aishell3_vctk/ernie_sat/local/train.sh @@ -0,0 +1,12 @@ +#!/bin/bash + +config_path=$1 +train_output_path=$2 + +python3 ${BIN_DIR}/train.py \ + --train-metadata=dump/train/norm/metadata.jsonl \ + --dev-metadata=dump/dev/norm/metadata.jsonl \ + --config=${config_path} \ + --output-dir=${train_output_path} \ + --ngpu=2 \ + --phones-dict=dump/phone_id_map.txt \ No newline at end of file diff --git a/examples/aishell3_vctk/ernie_sat/path.sh b/examples/aishell3_vctk/ernie_sat/path.sh new file mode 100755 index 0000000000000000000000000000000000000000..4ecab02517d5a4bd59baf95bb5f536e263ce7ac0 --- /dev/null +++ b/examples/aishell3_vctk/ernie_sat/path.sh @@ -0,0 +1,13 @@ +#!/bin/bash +export MAIN_ROOT=`realpath ${PWD}/../../../` + +export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH} +export LC_ALL=C + +export PYTHONDONTWRITEBYTECODE=1 +# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C +export PYTHONIOENCODING=UTF-8 +export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH} + +MODEL=ernie_sat +export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL} \ No newline at end of file diff --git a/examples/aishell3_vctk/ernie_sat/run.sh b/examples/aishell3_vctk/ernie_sat/run.sh new file mode 100755 index 0000000000000000000000000000000000000000..d75a19f23481f152d8bb59212732d457f325b4ea --- /dev/null +++ b/examples/aishell3_vctk/ernie_sat/run.sh @@ -0,0 +1,32 @@ +#!/bin/bash + +set -e +source path.sh + +gpus=0,1 +stage=0 +stop_stage=100 + +conf_path=conf/default.yaml +train_output_path=exp/default +ckpt_name=snapshot_iter_153.pdz + +# with the following command, you can choose the stage range you want to run +# such as `./run.sh --stage 0 --stop-stage 0` +# this can not be mixed use with `$1`, `$2` ... +source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 + +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + # prepare data + ./local/preprocess.sh ${conf_path} || exit -1 +fi + +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + # train model, all `ckpt` under `train_output_path/checkpoints/` dir + CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1 +fi + +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + # synthesize, vocoder is pwgan + CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1 +fi diff --git a/examples/csmsc/tts2/conf/default.yaml b/examples/csmsc/tts2/conf/default.yaml index a3366b8f92660386009847d748ba9072b99bb4bc..a5a258b7edaee1a88adfa37c9968ba84e8b775ff 100644 --- a/examples/csmsc/tts2/conf/default.yaml +++ b/examples/csmsc/tts2/conf/default.yaml @@ -21,22 +21,22 @@ num_workers: 4 # MODEL SETTING # ########################################################### model: - encoder_hidden_size: 128 - encoder_kernel_size: 3 - encoder_dilations: [1, 3, 9, 27, 1, 3, 9, 27, 1, 1] - duration_predictor_hidden_size: 128 - decoder_hidden_size: 128 - decoder_output_size: 80 - decoder_kernel_size: 3 - decoder_dilations: [1, 3, 9, 27, 1, 3, 9, 27, 1, 3, 9, 27, 1, 3, 9, 27, 1, 1] + encoder_hidden_size: 128 + encoder_kernel_size: 3 + encoder_dilations: [1, 3, 9, 27, 1, 3, 9, 27, 1, 1] + duration_predictor_hidden_size: 128 + decoder_hidden_size: 128 + decoder_output_size: 80 + decoder_kernel_size: 3 + decoder_dilations: [1, 3, 9, 27, 1, 3, 9, 27, 1, 3, 9, 27, 1, 3, 9, 27, 1, 1] ########################################################### # OPTIMIZER SETTING # ########################################################### optimizer: - optim: adam # optimizer type - learning_rate: 0.002 # learning rate - max_grad_norm: 1 + optim: adam # optimizer type + learning_rate: 0.002 # learning rate + max_grad_norm: 1 ########################################################### # TRAINING SETTING # diff --git a/examples/csmsc/voc3/conf/default.yaml b/examples/csmsc/voc3/conf/default.yaml index fbff54f193f074402cde9d571aec51f880fc3deb..a5ee17808516296c7e864bc6d9293e7da40be372 100644 --- a/examples/csmsc/voc3/conf/default.yaml +++ b/examples/csmsc/voc3/conf/default.yaml @@ -29,7 +29,7 @@ generator_params: out_channels: 4 # Number of output channels. kernel_size: 7 # Kernel size of initial and final conv layers. channels: 384 # Initial number of channels for conv layers. - upsample_scales: [5, 5, 3] # List of Upsampling scales. prod(upsample_scales) == n_shift + upsample_scales: [5, 5, 3] # List of Upsampling scales. prod(upsample_scales) x out_channels == n_shift stack_kernel_size: 3 # Kernel size of dilated conv layers in residual stack. stacks: 4 # Number of stacks in a single residual stack module. use_weight_norm: True # Whether to use weight normalization. diff --git a/examples/csmsc/voc3/conf/finetune.yaml b/examples/csmsc/voc3/conf/finetune.yaml index 0a38c28200e16a7a73c76eb24562b1d6e30c8454..8c37ac30272dfbdf56ba75d7167d01c02a8f4542 100644 --- a/examples/csmsc/voc3/conf/finetune.yaml +++ b/examples/csmsc/voc3/conf/finetune.yaml @@ -29,7 +29,7 @@ generator_params: out_channels: 4 # Number of output channels. kernel_size: 7 # Kernel size of initial and final conv layers. channels: 384 # Initial number of channels for conv layers. - upsample_scales: [5, 5, 3] # List of Upsampling scales. prod(upsample_scales) == n_shift + upsample_scales: [5, 5, 3] # List of Upsampling scales. prod(upsample_scales) x out_channels == n_shift stack_kernel_size: 3 # Kernel size of dilated conv layers in residual stack. stacks: 4 # Number of stacks in a single residual stack module. use_weight_norm: True # Whether to use weight normalization. diff --git a/examples/ernie_sat/local/align.py b/examples/ernie_sat/local/align.py index 025877ddf35dd72dd5a9e80bfe3714998374eff0..ff47cac5bcfeb6ea2bfbb67dcff794b7d75058a2 100755 --- a/examples/ernie_sat/local/align.py +++ b/examples/ernie_sat/local/align.py @@ -1,3 +1,16 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. """ Usage: align.py wavfile trsfile outwordfile outphonefile """ diff --git a/examples/ernie_sat/local/inference.py b/examples/ernie_sat/local/inference.py index 196d9c6d058e07362c817735fd04acf24c221ba3..e6a0788fdb91b6452402aca6fa6c3e915e4caf46 100644 --- a/examples/ernie_sat/local/inference.py +++ b/examples/ernie_sat/local/inference.py @@ -1,4 +1,16 @@ -#!/usr/bin/env python3 +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import os import random from typing import Dict @@ -305,7 +317,6 @@ def get_dur_adj_factor(orig_dur: List[int], def prep_feats_with_dur(wav_path: str, - mlm_model: nn.Layer, source_lang: str="English", target_lang: str="English", old_str: str="", @@ -425,8 +436,7 @@ def prep_feats_with_dur(wav_path: str, return new_wav, new_phns, new_mfa_start, new_mfa_end, old_span_bdy, new_span_bdy -def prep_feats(mlm_model: nn.Layer, - wav_path: str, +def prep_feats(wav_path: str, source_lang: str="english", target_lang: str="english", old_str: str="", @@ -440,7 +450,6 @@ def prep_feats(mlm_model: nn.Layer, wav, phns, mfa_start, mfa_end, old_span_bdy, new_span_bdy = prep_feats_with_dur( source_lang=source_lang, target_lang=target_lang, - mlm_model=mlm_model, old_str=old_str, new_str=new_str, wav_path=wav_path, @@ -482,7 +491,6 @@ def decode_with_model(mlm_model: nn.Layer, batch, old_span_bdy, new_span_bdy = prep_feats( source_lang=source_lang, target_lang=target_lang, - mlm_model=mlm_model, wav_path=wav_path, old_str=old_str, new_str=new_str, diff --git a/examples/ernie_sat/local/inference_new.py b/examples/ernie_sat/local/inference_new.py new file mode 100644 index 0000000000000000000000000000000000000000..525967eb1b28e2052cf83bdc9fc80b57e6d18c9b --- /dev/null +++ b/examples/ernie_sat/local/inference_new.py @@ -0,0 +1,622 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import random +from typing import Dict +from typing import List + +import librosa +import numpy as np +import paddle +import soundfile as sf +import yaml +from align import alignment +from align import alignment_zh +from align import words2phns +from align import words2phns_zh +from paddle import nn +from sedit_arg_parser import parse_args +from utils import eval_durs +from utils import get_voc_out +from utils import is_chinese +from utils import load_num_sequence_text +from utils import read_2col_text +from yacs.config import CfgNode + +from paddlespeech.t2s.datasets.am_batch_fn import build_mlm_collate_fn +from paddlespeech.t2s.models.ernie_sat.ernie_sat import ErnieSAT + +random.seed(0) +np.random.seed(0) + + +def get_wav(wav_path: str, + source_lang: str='english', + target_lang: str='english', + model_name: str="paddle_checkpoint_en", + old_str: str="", + new_str: str="", + non_autoreg: bool=True): + wav_org, output_feat, old_span_bdy, new_span_bdy, fs, hop_length = get_mlm_output( + source_lang=source_lang, + target_lang=target_lang, + model_name=model_name, + wav_path=wav_path, + old_str=old_str, + new_str=new_str, + use_teacher_forcing=non_autoreg) + + masked_feat = output_feat[new_span_bdy[0]:new_span_bdy[1]] + + alt_wav = get_voc_out(masked_feat) + + old_time_bdy = [hop_length * x for x in old_span_bdy] + + wav_replaced = np.concatenate( + [wav_org[:old_time_bdy[0]], alt_wav, wav_org[old_time_bdy[1]:]]) + + data_dict = {"origin": wav_org, "output": wav_replaced} + + return data_dict + + +def load_model(model_name: str="paddle_checkpoint_en"): + config_path = './pretrained_model/{}/default.yaml'.format(model_name) + model_path = './pretrained_model/{}/model.pdparams'.format(model_name) + with open(config_path) as f: + conf = CfgNode(yaml.safe_load(f)) + token_list = list(conf.token_list) + vocab_size = len(token_list) + odim = conf.n_mels + mlm_model = ErnieSAT(idim=vocab_size, odim=odim, **conf["model"]) + state_dict = paddle.load(model_path) + new_state_dict = {} + for key, value in state_dict.items(): + new_key = "model." + key + new_state_dict[new_key] = value + mlm_model.set_state_dict(new_state_dict) + mlm_model.eval() + + return mlm_model, conf + + +def read_data(uid: str, prefix: os.PathLike): + # 获取 uid 对应的文本 + mfa_text = read_2col_text(prefix + '/text')[uid] + # 获取 uid 对应的音频路径 + mfa_wav_path = read_2col_text(prefix + '/wav.scp')[uid] + if not os.path.isabs(mfa_wav_path): + mfa_wav_path = prefix + mfa_wav_path + return mfa_text, mfa_wav_path + + +def get_align_data(uid: str, prefix: os.PathLike): + mfa_path = prefix + "mfa_" + mfa_text = read_2col_text(mfa_path + 'text')[uid] + mfa_start = load_num_sequence_text( + mfa_path + 'start', loader_type='text_float')[uid] + mfa_end = load_num_sequence_text( + mfa_path + 'end', loader_type='text_float')[uid] + mfa_wav_path = read_2col_text(mfa_path + 'wav.scp')[uid] + return mfa_text, mfa_start, mfa_end, mfa_wav_path + + +# 获取需要被 mask 的 mel 帧的范围 +def get_masked_mel_bdy(mfa_start: List[float], + mfa_end: List[float], + fs: int, + hop_length: int, + span_to_repl: List[List[int]]): + align_start = np.array(mfa_start) + align_end = np.array(mfa_end) + align_start = np.floor(fs * align_start / hop_length).astype('int') + align_end = np.floor(fs * align_end / hop_length).astype('int') + if span_to_repl[0] >= len(mfa_start): + span_bdy = [align_end[-1], align_end[-1]] + else: + span_bdy = [ + align_start[span_to_repl[0]], align_end[span_to_repl[1] - 1] + ] + return span_bdy, align_start, align_end + + +def recover_dict(word2phns: Dict[str, str], tp_word2phns: Dict[str, str]): + dic = {} + keys_to_del = [] + exist_idx = [] + sp_count = 0 + add_sp_count = 0 + for key in word2phns.keys(): + idx, wrd = key.split('_') + if wrd == 'sp': + sp_count += 1 + exist_idx.append(int(idx)) + else: + keys_to_del.append(key) + + for key in keys_to_del: + del word2phns[key] + + cur_id = 0 + for key in tp_word2phns.keys(): + if cur_id in exist_idx: + dic[str(cur_id) + "_sp"] = 'sp' + cur_id += 1 + add_sp_count += 1 + idx, wrd = key.split('_') + dic[str(cur_id) + "_" + wrd] = tp_word2phns[key] + cur_id += 1 + + if add_sp_count + 1 == sp_count: + dic[str(cur_id) + "_sp"] = 'sp' + add_sp_count += 1 + + assert add_sp_count == sp_count, "sp are not added in dic" + return dic + + +def get_max_idx(dic): + return sorted([int(key.split('_')[0]) for key in dic.keys()])[-1] + + +def get_phns_and_spans(wav_path: str, + old_str: str="", + new_str: str="", + source_lang: str="english", + target_lang: str="english"): + is_append = (old_str == new_str[:len(old_str)]) + old_phns, mfa_start, mfa_end = [], [], [] + # source + if source_lang == "english": + intervals, word2phns = alignment(wav_path, old_str) + elif source_lang == "chinese": + intervals, word2phns = alignment_zh(wav_path, old_str) + _, tp_word2phns = words2phns_zh(old_str) + + for key, value in tp_word2phns.items(): + idx, wrd = key.split('_') + cur_val = " ".join(value) + tp_word2phns[key] = cur_val + + word2phns = recover_dict(word2phns, tp_word2phns) + else: + assert source_lang == "chinese" or source_lang == "english", \ + "source_lang is wrong..." + + for item in intervals: + old_phns.append(item[0]) + mfa_start.append(float(item[1])) + mfa_end.append(float(item[2])) + # target + if is_append and (source_lang != target_lang): + cross_lingual_clone = True + else: + cross_lingual_clone = False + + if cross_lingual_clone: + str_origin = new_str[:len(old_str)] + str_append = new_str[len(old_str):] + + if target_lang == "chinese": + phns_origin, origin_word2phns = words2phns(str_origin) + phns_append, append_word2phns_tmp = words2phns_zh(str_append) + + elif target_lang == "english": + # 原始句子 + phns_origin, origin_word2phns = words2phns_zh(str_origin) + # clone 句子 + phns_append, append_word2phns_tmp = words2phns(str_append) + else: + assert target_lang == "chinese" or target_lang == "english", \ + "cloning is not support for this language, please check it." + + new_phns = phns_origin + phns_append + + append_word2phns = {} + length = len(origin_word2phns) + for key, value in append_word2phns_tmp.items(): + idx, wrd = key.split('_') + append_word2phns[str(int(idx) + length) + '_' + wrd] = value + new_word2phns = origin_word2phns.copy() + new_word2phns.update(append_word2phns) + + else: + if source_lang == target_lang and target_lang == "english": + new_phns, new_word2phns = words2phns(new_str) + elif source_lang == target_lang and target_lang == "chinese": + new_phns, new_word2phns = words2phns_zh(new_str) + else: + assert source_lang == target_lang, \ + "source language is not same with target language..." + + span_to_repl = [0, len(old_phns) - 1] + span_to_add = [0, len(new_phns) - 1] + left_idx = 0 + new_phns_left = [] + sp_count = 0 + # find the left different index + for key in word2phns.keys(): + idx, wrd = key.split('_') + if wrd == 'sp': + sp_count += 1 + new_phns_left.append('sp') + else: + idx = str(int(idx) - sp_count) + if idx + '_' + wrd in new_word2phns: + left_idx += len(new_word2phns[idx + '_' + wrd]) + new_phns_left.extend(word2phns[key].split()) + else: + span_to_repl[0] = len(new_phns_left) + span_to_add[0] = len(new_phns_left) + break + + # reverse word2phns and new_word2phns + right_idx = 0 + new_phns_right = [] + sp_count = 0 + word2phns_max_idx = get_max_idx(word2phns) + new_word2phns_max_idx = get_max_idx(new_word2phns) + new_phns_mid = [] + if is_append: + new_phns_right = [] + new_phns_mid = new_phns[left_idx:] + span_to_repl[0] = len(new_phns_left) + span_to_add[0] = len(new_phns_left) + span_to_add[1] = len(new_phns_left) + len(new_phns_mid) + span_to_repl[1] = len(old_phns) - len(new_phns_right) + # speech edit + else: + for key in list(word2phns.keys())[::-1]: + idx, wrd = key.split('_') + if wrd == 'sp': + sp_count += 1 + new_phns_right = ['sp'] + new_phns_right + else: + idx = str(new_word2phns_max_idx - (word2phns_max_idx - int(idx) + - sp_count)) + if idx + '_' + wrd in new_word2phns: + right_idx -= len(new_word2phns[idx + '_' + wrd]) + new_phns_right = word2phns[key].split() + new_phns_right + else: + span_to_repl[1] = len(old_phns) - len(new_phns_right) + new_phns_mid = new_phns[left_idx:right_idx] + span_to_add[1] = len(new_phns_left) + len(new_phns_mid) + if len(new_phns_mid) == 0: + span_to_add[1] = min(span_to_add[1] + 1, len(new_phns)) + span_to_add[0] = max(0, span_to_add[0] - 1) + span_to_repl[0] = max(0, span_to_repl[0] - 1) + span_to_repl[1] = min(span_to_repl[1] + 1, + len(old_phns)) + break + new_phns = new_phns_left + new_phns_mid + new_phns_right + ''' + For that reason cover should not be given. + For that reason cover is impossible to be given. + span_to_repl: [17, 23] "should not" + span_to_add: [17, 30] "is impossible to" + ''' + return mfa_start, mfa_end, old_phns, new_phns, span_to_repl, span_to_add + + +# mfa 获得的 duration 和 fs2 的 duration_predictor 获取的 duration 可能不同 +# 此处获得一个缩放比例, 用于预测值和真实值之间的缩放 +def get_dur_adj_factor(orig_dur: List[int], + pred_dur: List[int], + phns: List[str]): + length = 0 + factor_list = [] + for orig, pred, phn in zip(orig_dur, pred_dur, phns): + if pred == 0 or phn == 'sp': + continue + else: + factor_list.append(orig / pred) + factor_list = np.array(factor_list) + factor_list.sort() + if len(factor_list) < 5: + return 1 + length = 2 + avg = np.average(factor_list[length:-length]) + return avg + + +def prep_feats_with_dur(wav_path: str, + source_lang: str="English", + target_lang: str="English", + old_str: str="", + new_str: str="", + mask_reconstruct: bool=False, + duration_adjust: bool=True, + start_end_sp: bool=False, + fs: int=24000, + hop_length: int=300): + ''' + Returns: + np.ndarray: new wav, replace the part to be edited in original wav with 0 + List[str]: new phones + List[float]: mfa start of new wav + List[float]: mfa end of new wav + List[int]: masked mel boundary of original wav + List[int]: masked mel boundary of new wav + ''' + wav_org, _ = librosa.load(wav_path, sr=fs) + + mfa_start, mfa_end, old_phns, new_phns, span_to_repl, span_to_add = get_phns_and_spans( + wav_path=wav_path, + old_str=old_str, + new_str=new_str, + source_lang=source_lang, + target_lang=target_lang) + + if start_end_sp: + if new_phns[-1] != 'sp': + new_phns = new_phns + ['sp'] + # 中文的 phns 不一定都在 fastspeech2 的字典里, 用 sp 代替 + if target_lang == "english" or target_lang == "chinese": + old_durs = eval_durs(old_phns, target_lang=source_lang) + else: + assert target_lang == "chinese" or target_lang == "english", \ + "calculate duration_predict is not support for this language..." + + orig_old_durs = [e - s for e, s in zip(mfa_end, mfa_start)] + if '[MASK]' in new_str: + new_phns = old_phns + span_to_add = span_to_repl + d_factor_left = get_dur_adj_factor( + orig_dur=orig_old_durs[:span_to_repl[0]], + pred_dur=old_durs[:span_to_repl[0]], + phns=old_phns[:span_to_repl[0]]) + d_factor_right = get_dur_adj_factor( + orig_dur=orig_old_durs[span_to_repl[1]:], + pred_dur=old_durs[span_to_repl[1]:], + phns=old_phns[span_to_repl[1]:]) + d_factor = (d_factor_left + d_factor_right) / 2 + new_durs_adjusted = [d_factor * i for i in old_durs] + else: + if duration_adjust: + d_factor = get_dur_adj_factor( + orig_dur=orig_old_durs, pred_dur=old_durs, phns=old_phns) + d_factor = d_factor * 1.25 + else: + d_factor = 1 + + if target_lang == "english" or target_lang == "chinese": + new_durs = eval_durs(new_phns, target_lang=target_lang) + else: + assert target_lang == "chinese" or target_lang == "english", \ + "calculate duration_predict is not support for this language..." + + new_durs_adjusted = [d_factor * i for i in new_durs] + + new_span_dur_sum = sum(new_durs_adjusted[span_to_add[0]:span_to_add[1]]) + old_span_dur_sum = sum(orig_old_durs[span_to_repl[0]:span_to_repl[1]]) + dur_offset = new_span_dur_sum - old_span_dur_sum + new_mfa_start = mfa_start[:span_to_repl[0]] + new_mfa_end = mfa_end[:span_to_repl[0]] + for i in new_durs_adjusted[span_to_add[0]:span_to_add[1]]: + if len(new_mfa_end) == 0: + new_mfa_start.append(0) + new_mfa_end.append(i) + else: + new_mfa_start.append(new_mfa_end[-1]) + new_mfa_end.append(new_mfa_end[-1] + i) + new_mfa_start += [i + dur_offset for i in mfa_start[span_to_repl[1]:]] + new_mfa_end += [i + dur_offset for i in mfa_end[span_to_repl[1]:]] + + # 3. get new wav + # 在原始句子后拼接 + if span_to_repl[0] >= len(mfa_start): + left_idx = len(wav_org) + right_idx = left_idx + # 在原始句子中间替换 + else: + left_idx = int(np.floor(mfa_start[span_to_repl[0]] * fs)) + right_idx = int(np.ceil(mfa_end[span_to_repl[1] - 1] * fs)) + blank_wav = np.zeros( + (int(np.ceil(new_span_dur_sum * fs)), ), dtype=wav_org.dtype) + # 原始音频,需要编辑的部分替换成空音频,空音频的时间由 fs2 的 duration_predictor 决定 + new_wav = np.concatenate( + [wav_org[:left_idx], blank_wav, wav_org[right_idx:]]) + + # 4. get old and new mel span to be mask + # [92, 92] + + old_span_bdy, mfa_start, mfa_end = get_masked_mel_bdy( + mfa_start=mfa_start, + mfa_end=mfa_end, + fs=fs, + hop_length=hop_length, + span_to_repl=span_to_repl) + # [92, 174] + # new_mfa_start, new_mfa_end 时间级别的开始和结束时间 -> 帧级别 + new_span_bdy, new_mfa_start, new_mfa_end = get_masked_mel_bdy( + mfa_start=new_mfa_start, + mfa_end=new_mfa_end, + fs=fs, + hop_length=hop_length, + span_to_repl=span_to_add) + + # old_span_bdy, new_span_bdy 是帧级别的范围 + return new_wav, new_phns, new_mfa_start, new_mfa_end, old_span_bdy, new_span_bdy + + +def prep_feats(wav_path: str, + source_lang: str="english", + target_lang: str="english", + old_str: str="", + new_str: str="", + duration_adjust: bool=True, + start_end_sp: bool=False, + mask_reconstruct: bool=False, + fs: int=24000, + hop_length: int=300, + token_list: List[str]=[]): + wav, phns, mfa_start, mfa_end, old_span_bdy, new_span_bdy = prep_feats_with_dur( + source_lang=source_lang, + target_lang=target_lang, + old_str=old_str, + new_str=new_str, + wav_path=wav_path, + duration_adjust=duration_adjust, + start_end_sp=start_end_sp, + mask_reconstruct=mask_reconstruct, + fs=fs, + hop_length=hop_length) + + token_to_id = {item: i for i, item in enumerate(token_list)} + text = np.array( + list(map(lambda x: token_to_id.get(x, token_to_id['']), phns))) + span_bdy = np.array(new_span_bdy) + + batch = [('1', { + "speech": wav, + "align_start": mfa_start, + "align_end": mfa_end, + "text": text, + "span_bdy": span_bdy + })] + + return batch, old_span_bdy, new_span_bdy + + +def decode_with_model(mlm_model: nn.Layer, + collate_fn, + wav_path: str, + source_lang: str="english", + target_lang: str="english", + old_str: str="", + new_str: str="", + use_teacher_forcing: bool=False, + duration_adjust: bool=True, + start_end_sp: bool=False, + fs: int=24000, + hop_length: int=300, + token_list: List[str]=[]): + batch, old_span_bdy, new_span_bdy = prep_feats( + source_lang=source_lang, + target_lang=target_lang, + wav_path=wav_path, + old_str=old_str, + new_str=new_str, + duration_adjust=duration_adjust, + start_end_sp=start_end_sp, + fs=fs, + hop_length=hop_length, + token_list=token_list) + + feats = collate_fn(batch)[1] + + if 'text_masked_pos' in feats.keys(): + feats.pop('text_masked_pos') + + output = mlm_model.inference( + text=feats['text'], + speech=feats['speech'], + masked_pos=feats['masked_pos'], + speech_mask=feats['speech_mask'], + text_mask=feats['text_mask'], + speech_seg_pos=feats['speech_seg_pos'], + text_seg_pos=feats['text_seg_pos'], + span_bdy=new_span_bdy, + use_teacher_forcing=use_teacher_forcing) + + # 拼接音频 + output_feat = paddle.concat(x=output, axis=0) + wav_org, _ = librosa.load(wav_path, sr=fs) + return wav_org, output_feat, old_span_bdy, new_span_bdy, fs, hop_length + + +def get_mlm_output(wav_path: str, + model_name: str="paddle_checkpoint_en", + source_lang: str="english", + target_lang: str="english", + old_str: str="", + new_str: str="", + use_teacher_forcing: bool=False, + duration_adjust: bool=True, + start_end_sp: bool=False): + mlm_model, train_conf = load_model(model_name) + + collate_fn = build_mlm_collate_fn( + sr=train_conf.fs, + n_fft=train_conf.n_fft, + hop_length=train_conf.n_shift, + win_length=train_conf.win_length, + n_mels=train_conf.n_mels, + fmin=train_conf.fmin, + fmax=train_conf.fmax, + mlm_prob=train_conf.mlm_prob, + mean_phn_span=train_conf.mean_phn_span, + seg_emb=train_conf.model['enc_input_layer'] == 'sega_mlm') + + return decode_with_model( + source_lang=source_lang, + target_lang=target_lang, + mlm_model=mlm_model, + collate_fn=collate_fn, + wav_path=wav_path, + old_str=old_str, + new_str=new_str, + use_teacher_forcing=use_teacher_forcing, + duration_adjust=duration_adjust, + start_end_sp=start_end_sp, + fs=train_conf.fs, + hop_length=train_conf.n_shift, + token_list=train_conf.token_list) + + +def evaluate(uid: str, + source_lang: str="english", + target_lang: str="english", + prefix: os.PathLike="./prompt/dev/", + model_name: str="paddle_checkpoint_en", + new_str: str="", + prompt_decoding: bool=False, + task_name: str=None): + + # get origin text and path of origin wav + old_str, wav_path = read_data(uid=uid, prefix=prefix) + + if task_name == 'edit': + new_str = new_str + elif task_name == 'synthesize': + new_str = old_str + new_str + else: + new_str = old_str + ' '.join([ch for ch in new_str if is_chinese(ch)]) + + print('new_str is ', new_str) + + results_dict = get_wav( + source_lang=source_lang, + target_lang=target_lang, + model_name=model_name, + wav_path=wav_path, + old_str=old_str, + new_str=new_str) + return results_dict + + +if __name__ == "__main__": + # parse config and args + args = parse_args() + + data_dict = evaluate( + uid=args.uid, + source_lang=args.source_lang, + target_lang=args.target_lang, + prefix=args.prefix, + model_name=args.model_name, + new_str=args.new_str, + task_name=args.task_name) + sf.write(args.output_name, data_dict['output'], samplerate=24000) + print("finished...") diff --git a/examples/ernie_sat/local/sedit_arg_parser.py b/examples/ernie_sat/local/sedit_arg_parser.py index 21c6d0b4b938001532a3640675f298b2bb5fd68b..ad7e5719115b10ce83b36e0c78a3b84f21e0b3ae 100644 --- a/examples/ernie_sat/local/sedit_arg_parser.py +++ b/examples/ernie_sat/local/sedit_arg_parser.py @@ -1,3 +1,16 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import argparse diff --git a/examples/ernie_sat/local/utils.py b/examples/ernie_sat/local/utils.py index 836942a2608050db74e4dca7535b45b2a2698933..f2dce504af6e34ecddf641a738caee8b0b705613 100644 --- a/examples/ernie_sat/local/utils.py +++ b/examples/ernie_sat/local/utils.py @@ -1,3 +1,16 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. from pathlib import Path from typing import Dict from typing import List diff --git a/examples/ernie_sat/run_clone_en_to_zh_new.sh b/examples/ernie_sat/run_clone_en_to_zh_new.sh new file mode 100755 index 0000000000000000000000000000000000000000..12fdf23f1eb1fbc2934fe49a6acae8363cf826d3 --- /dev/null +++ b/examples/ernie_sat/run_clone_en_to_zh_new.sh @@ -0,0 +1,27 @@ +#!/bin/bash + +set -e +source path.sh + +# en --> zh 的 语音合成 +# 根据 Prompt_003_new 作为提示语音: This was not the show for me. 来合成: '今天天气很好' +# 注: 输入的 new_str 需为中文汉字, 否则会通过预处理只保留中文汉字, 即合成预处理后的中文语音。 + +python local/inference_new.py \ + --task_name=cross-lingual_clone \ + --model_name=paddle_checkpoint_dual_mask_enzh \ + --uid=Prompt_003_new \ + --new_str='今天天气很好.' \ + --prefix='./prompt/dev/' \ + --source_lang=english \ + --target_lang=chinese \ + --output_name=pred_clone.wav \ + --voc=pwgan_aishell3 \ + --voc_config=download/pwg_aishell3_ckpt_0.5/default.yaml \ + --voc_ckpt=download/pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ + --voc_stat=download/pwg_aishell3_ckpt_0.5/feats_stats.npy \ + --am=fastspeech2_csmsc \ + --am_config=download/fastspeech2_conformer_baker_ckpt_0.5/conformer.yaml \ + --am_ckpt=download/fastspeech2_conformer_baker_ckpt_0.5/snapshot_iter_76000.pdz \ + --am_stat=download/fastspeech2_conformer_baker_ckpt_0.5/speech_stats.npy \ + --phones_dict=download/fastspeech2_conformer_baker_ckpt_0.5/phone_id_map.txt diff --git a/examples/ernie_sat/run_gen_en_new.sh b/examples/ernie_sat/run_gen_en_new.sh new file mode 100755 index 0000000000000000000000000000000000000000..d76b004308c178f7d5891a13c6340d795f102ee7 --- /dev/null +++ b/examples/ernie_sat/run_gen_en_new.sh @@ -0,0 +1,26 @@ +#!/bin/bash + +set -e +source path.sh + +# 纯英文的语音合成 +# 样例为根据 p299_096 对应的语音作为提示语音: This was not the show for me. 来合成: 'I enjoy my life.' + +python local/inference_new.py \ + --task_name=synthesize \ + --model_name=paddle_checkpoint_en \ + --uid=p299_096 \ + --new_str='I enjoy my life, do you?' \ + --prefix='./prompt/dev/' \ + --source_lang=english \ + --target_lang=english \ + --output_name=pred_gen.wav \ + --voc=pwgan_aishell3 \ + --voc_config=download/pwg_aishell3_ckpt_0.5/default.yaml \ + --voc_ckpt=download/pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ + --voc_stat=download/pwg_aishell3_ckpt_0.5/feats_stats.npy \ + --am=fastspeech2_ljspeech \ + --am_config=download/fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml \ + --am_ckpt=download/fastspeech2_nosil_ljspeech_ckpt_0.5/snapshot_iter_100000.pdz \ + --am_stat=download/fastspeech2_nosil_ljspeech_ckpt_0.5/speech_stats.npy \ + --phones_dict=download/fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt diff --git a/examples/ernie_sat/run_sedit_en_new.sh b/examples/ernie_sat/run_sedit_en_new.sh new file mode 100755 index 0000000000000000000000000000000000000000..0952d280c728adc2dae046fb89df814d438057f1 --- /dev/null +++ b/examples/ernie_sat/run_sedit_en_new.sh @@ -0,0 +1,27 @@ +#!/bin/bash + +set -e +source path.sh + +# 纯英文的语音编辑 +# 样例为把 p243_new 对应的原始语音: For that reason cover should not be given.编辑成 'for that reason cover is impossible to be given.' 对应的语音 +# NOTE: 语音编辑任务暂支持句子中 1 个位置的替换或者插入文本操作 + +python local/inference_new.py \ + --task_name=edit \ + --model_name=paddle_checkpoint_en \ + --uid=p243_new \ + --new_str='for that reason cover is impossible to be given.' \ + --prefix='./prompt/dev/' \ + --source_lang=english \ + --target_lang=english \ + --output_name=pred_edit.wav \ + --voc=pwgan_aishell3 \ + --voc_config=download/pwg_aishell3_ckpt_0.5/default.yaml \ + --voc_ckpt=download/pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ + --voc_stat=download/pwg_aishell3_ckpt_0.5/feats_stats.npy \ + --am=fastspeech2_ljspeech \ + --am_config=download/fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml \ + --am_ckpt=download/fastspeech2_nosil_ljspeech_ckpt_0.5/snapshot_iter_100000.pdz \ + --am_stat=download/fastspeech2_nosil_ljspeech_ckpt_0.5/speech_stats.npy \ + --phones_dict=download/fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt diff --git a/examples/ernie_sat/test_run_new.sh b/examples/ernie_sat/test_run_new.sh new file mode 100755 index 0000000000000000000000000000000000000000..bf8a4e02db6afee902dbca3a06367a6c4d1695df --- /dev/null +++ b/examples/ernie_sat/test_run_new.sh @@ -0,0 +1,6 @@ +#!/bin/bash + +rm -rf *.wav +./run_sedit_en_new.sh # 语音编辑任务(英文) +./run_gen_en_new.sh # 个性化语音合成任务(英文) +./run_clone_en_to_zh_new.sh # 跨语言语音合成任务(英文到中文的语音克隆) \ No newline at end of file diff --git a/examples/iwslt2012/punc0/conf/default.yaml b/examples/iwslt2012/punc0/conf/default.yaml index 74ced99327b38dff4863718b75a7bff8096061c1..e88ce2ff1a6e7fd1f5235ad6d41d2f26b69bf726 100644 --- a/examples/iwslt2012/punc0/conf/default.yaml +++ b/examples/iwslt2012/punc0/conf/default.yaml @@ -29,7 +29,7 @@ optimizer_params: scheduler_params: learning_rate: 1.0e-5 # learning rate. - gamma: 1.0 # scheduler gamma. + gamma: 0.9999 # scheduler gamma must between(0.0, 1.0) and closer to 1.0 is better. ########################################################### # TRAINING SETTING # diff --git a/examples/vctk/ernie_sat/conf/default.yaml b/examples/vctk/ernie_sat/conf/default.yaml new file mode 100644 index 0000000000000000000000000000000000000000..672f937ef50b1e808588595841ba6bcba45db00a --- /dev/null +++ b/examples/vctk/ernie_sat/conf/default.yaml @@ -0,0 +1,163 @@ +########################################################### +# FEATURE EXTRACTION SETTING # +########################################################### + +fs: 24000 # sr +n_fft: 2048 # FFT size (samples). +n_shift: 300 # Hop size (samples). 12.5ms +win_length: 1200 # Window length (samples). 50ms + # If set to null, it will be the same as fft_size. +window: "hann" # Window function. + +# Only used for feats_type != raw + +fmin: 80 # Minimum frequency of Mel basis. +fmax: 7600 # Maximum frequency of Mel basis. +n_mels: 80 # The number of mel basis. + +mean_phn_span: 8 +mlm_prob: 0.8 + +########################################################### +# DATA SETTING # +########################################################### +batch_size: 20 +num_workers: 2 + +########################################################### +# MODEL SETTING # +########################################################### +model: + text_masking: false + postnet_layers: 5 + postnet_filts: 5 + postnet_chans: 256 + encoder_type: conformer + decoder_type: conformer + enc_input_layer: sega_mlm + enc_pre_speech_layer: 0 + enc_cnn_module_kernel: 7 + enc_attention_dim: 384 + enc_attention_heads: 2 + enc_linear_units: 1536 + enc_num_blocks: 4 + enc_dropout_rate: 0.2 + enc_positional_dropout_rate: 0.2 + enc_attention_dropout_rate: 0.2 + enc_normalize_before: true + enc_macaron_style: true + enc_use_cnn_module: true + enc_selfattention_layer_type: legacy_rel_selfattn + enc_activation_type: swish + enc_pos_enc_layer_type: legacy_rel_pos + enc_positionwise_layer_type: conv1d + enc_positionwise_conv_kernel_size: 3 + dec_cnn_module_kernel: 31 + dec_attention_dim: 384 + dec_attention_heads: 2 + dec_linear_units: 1536 + dec_num_blocks: 4 + dec_dropout_rate: 0.2 + dec_positional_dropout_rate: 0.2 + dec_attention_dropout_rate: 0.2 + dec_macaron_style: true + dec_use_cnn_module: true + dec_selfattention_layer_type: legacy_rel_selfattn + dec_activation_type: swish + dec_pos_enc_layer_type: legacy_rel_pos + dec_positionwise_layer_type: conv1d + dec_positionwise_conv_kernel_size: 3 + +########################################################### +# OPTIMIZER SETTING # +########################################################### +scheduler_params: + d_model: 384 + warmup_steps: 4000 +grad_clip: 1.0 + +########################################################### +# TRAINING SETTING # +########################################################### +max_epoch: 1500 +num_snapshots: 50 + +########################################################### +# OTHER SETTING # +########################################################### +seed: 0 + +token_list: +- +- +- AH0 +- T +- N +- sp +- D +- S +- R +- L +- IH1 +- DH +- AE1 +- M +- EH1 +- K +- Z +- W +- HH +- ER0 +- AH1 +- IY1 +- P +- V +- F +- B +- AY1 +- IY0 +- EY1 +- AA1 +- AO1 +- UW1 +- IH0 +- OW1 +- NG +- G +- SH +- ER1 +- Y +- TH +- AW1 +- CH +- UH1 +- IH2 +- JH +- OW0 +- EH2 +- OY1 +- AY2 +- EH0 +- EY2 +- UW0 +- AE2 +- AA2 +- OW2 +- AH2 +- ZH +- AO2 +- IY2 +- AE0 +- UW2 +- AY0 +- AA0 +- AO0 +- AW2 +- EY0 +- UH2 +- ER2 +- OY2 +- UH0 +- AW0 +- OY0 +- diff --git a/examples/vctk/ernie_sat/local/preprocess.sh b/examples/vctk/ernie_sat/local/preprocess.sh new file mode 100755 index 0000000000000000000000000000000000000000..a0a3881f0e7bd001c58f1802758a39cfd3050e1a --- /dev/null +++ b/examples/vctk/ernie_sat/local/preprocess.sh @@ -0,0 +1,61 @@ +#!/bin/bash + +stage=0 +stop_stage=100 + +config_path=$1 + +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + # get durations from MFA's result + echo "Generate durations.txt from MFA results ..." + python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \ + --inputdir=./vctk_alignment \ + --output durations.txt \ + --config=${config_path} +fi + +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + # extract features + echo "Extract features ..." + python3 ${BIN_DIR}/preprocess.py \ + --dataset=vctk \ + --rootdir=~/datasets/VCTK-Corpus-0.92/ \ + --dumpdir=dump \ + --dur-file=durations.txt \ + --config=${config_path} \ + --num-cpu=20 \ + --cut-sil=True +fi + +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + # get features' stats(mean and std) + echo "Get features' stats ..." + python3 ${MAIN_ROOT}/utils/compute_statistics.py \ + --metadata=dump/train/raw/metadata.jsonl \ + --field-name="speech" +fi + +if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then + # normalize and covert phone/speaker to id, dev and test should use train's stats + echo "Normalize ..." + python3 ${BIN_DIR}/normalize.py \ + --metadata=dump/train/raw/metadata.jsonl \ + --dumpdir=dump/train/norm \ + --speech-stats=dump/train/speech_stats.npy \ + --phones-dict=dump/phone_id_map.txt \ + --speaker-dict=dump/speaker_id_map.txt + + python3 ${BIN_DIR}/normalize.py \ + --metadata=dump/dev/raw/metadata.jsonl \ + --dumpdir=dump/dev/norm \ + --speech-stats=dump/train/speech_stats.npy \ + --phones-dict=dump/phone_id_map.txt \ + --speaker-dict=dump/speaker_id_map.txt + + python3 ${BIN_DIR}/normalize.py \ + --metadata=dump/test/raw/metadata.jsonl \ + --dumpdir=dump/test/norm \ + --speech-stats=dump/train/speech_stats.npy \ + --phones-dict=dump/phone_id_map.txt \ + --speaker-dict=dump/speaker_id_map.txt +fi diff --git a/examples/vctk/ernie_sat/local/synthesize.sh b/examples/vctk/ernie_sat/local/synthesize.sh new file mode 100755 index 0000000000000000000000000000000000000000..b24db018ac000aa7bbe1fd04b4d7c6fd5930eb5d --- /dev/null +++ b/examples/vctk/ernie_sat/local/synthesize.sh @@ -0,0 +1,45 @@ +#!/bin/bash + +config_path=$1 +train_output_path=$2 +ckpt_name=$3 + +stage=1 +stop_stage=1 + +# use am to predict duration here +# 增加 am_phones_dict am_tones_dict 等,也可以用新的方式构造 am, 不需要这么多参数了就 + +# pwgan +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/synthesize.py \ + --erniesat_config=${config_path} \ + --erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --erniesat_stat=dump/train/speech_stats.npy \ + --voc=pwgan_vctk \ + --voc_config=pwg_vctk_ckpt_0.1.1/default.yaml \ + --voc_ckpt=pwg_vctk_ckpt_0.1.1/snapshot_iter_1500000.pdz \ + --voc_stat=pwg_vctk_ckpt_0.1.1/feats_stats.npy \ + --test_metadata=dump/test/norm/metadata.jsonl \ + --output_dir=${train_output_path}/test \ + --phones_dict=dump/phone_id_map.txt +fi + +# hifigan +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/synthesize.py \ + --erniesat_config=${config_path} \ + --erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --erniesat_stat=dump/train/speech_stats.npy \ + --voc=hifigan_vctk \ + --voc_config=hifigan_vctk_ckpt_0.2.0/default.yaml \ + --voc_ckpt=hifigan_vctk_ckpt_0.2.0/snapshot_iter_2500000.pdz \ + --voc_stat=hifigan_vctk_ckpt_0.2.0/feats_stats.npy \ + --test_metadata=dump/test/norm/metadata.jsonl \ + --output_dir=${train_output_path}/test \ + --phones_dict=dump/phone_id_map.txt +fi diff --git a/examples/vctk/ernie_sat/local/train.sh b/examples/vctk/ernie_sat/local/train.sh new file mode 100755 index 0000000000000000000000000000000000000000..30720e8f5b7ed289be91765e42a6e05bdf279c4a --- /dev/null +++ b/examples/vctk/ernie_sat/local/train.sh @@ -0,0 +1,12 @@ +#!/bin/bash + +config_path=$1 +train_output_path=$2 + +python3 ${BIN_DIR}/train.py \ + --train-metadata=dump/train/norm/metadata.jsonl \ + --dev-metadata=dump/dev/norm/metadata.jsonl \ + --config=${config_path} \ + --output-dir=${train_output_path} \ + --ngpu=2 \ + --phones-dict=dump/phone_id_map.txt \ No newline at end of file diff --git a/examples/vctk/ernie_sat/path.sh b/examples/vctk/ernie_sat/path.sh new file mode 100755 index 0000000000000000000000000000000000000000..4ecab02517d5a4bd59baf95bb5f536e263ce7ac0 --- /dev/null +++ b/examples/vctk/ernie_sat/path.sh @@ -0,0 +1,13 @@ +#!/bin/bash +export MAIN_ROOT=`realpath ${PWD}/../../../` + +export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH} +export LC_ALL=C + +export PYTHONDONTWRITEBYTECODE=1 +# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C +export PYTHONIOENCODING=UTF-8 +export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH} + +MODEL=ernie_sat +export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL} \ No newline at end of file diff --git a/examples/vctk/ernie_sat/run.sh b/examples/vctk/ernie_sat/run.sh new file mode 100755 index 0000000000000000000000000000000000000000..d75a19f23481f152d8bb59212732d457f325b4ea --- /dev/null +++ b/examples/vctk/ernie_sat/run.sh @@ -0,0 +1,32 @@ +#!/bin/bash + +set -e +source path.sh + +gpus=0,1 +stage=0 +stop_stage=100 + +conf_path=conf/default.yaml +train_output_path=exp/default +ckpt_name=snapshot_iter_153.pdz + +# with the following command, you can choose the stage range you want to run +# such as `./run.sh --stage 0 --stop-stage 0` +# this can not be mixed use with `$1`, `$2` ... +source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 + +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + # prepare data + ./local/preprocess.sh ${conf_path} || exit -1 +fi + +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + # train model, all `ckpt` under `train_output_path/checkpoints/` dir + CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1 +fi + +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + # synthesize, vocoder is pwgan + CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1 +fi diff --git a/examples/vctk/tts3/conf/default.yaml b/examples/vctk/tts3/conf/default.yaml index 1bca9107b5eda75c17dd33656c092373f0758831..a75658d3d9a551a3e08b62dc6d1373609c43fcd6 100644 --- a/examples/vctk/tts3/conf/default.yaml +++ b/examples/vctk/tts3/conf/default.yaml @@ -24,7 +24,7 @@ f0max: 400 # Maximum f0 for pitch extraction. # DATA SETTING # ########################################################### batch_size: 64 -num_workers: 4 +num_workers: 2 ########################################################### @@ -88,8 +88,8 @@ updater: # OPTIMIZER SETTING # ########################################################### optimizer: - optim: adam # optimizer type - learning_rate: 0.001 # learning rate + optim: adam # optimizer type + learning_rate: 0.001 # learning rate ########################################################### # TRAINING SETTING # diff --git a/examples/zh_en_tts/tts3/README.md b/examples/zh_en_tts/tts3/README.md new file mode 100644 index 0000000000000000000000000000000000000000..6d38181cc55e0ee265d40b93f9306c95884fb34d --- /dev/null +++ b/examples/zh_en_tts/tts3/README.md @@ -0,0 +1,26 @@ +# Test +We train a Chinese-English mixed fastspeech2 model. The training code is still being sorted out, let's show how to use it first. +The sample rate of the synthesized audio is 22050 Hz. + +## Download pretrained models +Put pretrained models in a directory named `models`. + +- [fastspeech2_csmscljspeech_add-zhen.zip](https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_csmscljspeech_add-zhen.zip) +- [hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip) + +```bash +mkdir models +cd models +wget https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_csmscljspeech_add-zhen.zip +unzip fastspeech2_csmscljspeech_add-zhen.zip +wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip +unzip hifigan_ljspeech_ckpt_0.2.0.zip +cd ../ +``` + +## test +You can choose `--spk_id` {0, 1} in `local/synthesize_e2e.sh`. + +```bash +bash test.sh +``` diff --git a/examples/zh_en_tts/tts3/local/synthesize_e2e.sh b/examples/zh_en_tts/tts3/local/synthesize_e2e.sh new file mode 100755 index 0000000000000000000000000000000000000000..a206c3a84ace7e21fc1e1fde0ac1654a9070fe53 --- /dev/null +++ b/examples/zh_en_tts/tts3/local/synthesize_e2e.sh @@ -0,0 +1,31 @@ +#!/bin/bash + +model_dir=$1 +output=$2 +am_name=fastspeech2_csmscljspeech_add-zhen +am_model_dir=${model_dir}/${am_name}/ + +stage=1 +stop_stage=1 + + +# hifigan +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/../synthesize_e2e.py \ + --am=fastspeech2_mix \ + --am_config=${am_model_dir}/default.yaml \ + --am_ckpt=${am_model_dir}/snapshot_iter_94000.pdz \ + --am_stat=${am_model_dir}/speech_stats.npy \ + --voc=hifigan_ljspeech \ + --voc_config=${model_dir}/hifigan_ljspeech_ckpt_0.2.0/default.yaml \ + --voc_ckpt=${model_dir}/hifigan_ljspeech_ckpt_0.2.0/snapshot_iter_2500000.pdz \ + --voc_stat=${model_dir}/hifigan_ljspeech_ckpt_0.2.0/feats_stats.npy \ + --lang=mix \ + --text=${BIN_DIR}/../sentences_mix.txt \ + --output_dir=${output}/test_e2e \ + --phones_dict=${am_model_dir}/phone_id_map.txt \ + --speaker_dict=${am_model_dir}/speaker_id_map.txt \ + --spk_id 0 +fi diff --git a/examples/zh_en_tts/tts3/path.sh b/examples/zh_en_tts/tts3/path.sh new file mode 100755 index 0000000000000000000000000000000000000000..fb7e8411c80cc8cbf1c65dffaaf771bda961e10e --- /dev/null +++ b/examples/zh_en_tts/tts3/path.sh @@ -0,0 +1,13 @@ +#!/bin/bash +export MAIN_ROOT=`realpath ${PWD}/../../../` + +export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH} +export LC_ALL=C + +export PYTHONDONTWRITEBYTECODE=1 +# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C +export PYTHONIOENCODING=UTF-8 +export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH} + +MODEL=fastspeech2 +export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL} diff --git a/examples/zh_en_tts/tts3/test.sh b/examples/zh_en_tts/tts3/test.sh new file mode 100755 index 0000000000000000000000000000000000000000..ff34da14ab651bb1d1118a801674b3db1a6a5fbd --- /dev/null +++ b/examples/zh_en_tts/tts3/test.sh @@ -0,0 +1,23 @@ +#!/bin/bash + +set -e +source path.sh + +gpus=0,1 +stage=3 +stop_stage=100 + +model_dir=models +output_dir=output + +# with the following command, you can choose the stage range you want to run +# such as `./run.sh --stage 0 --stop-stage 0` +# this can not be mixed use with `$1`, `$2` ... +source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 + + +if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then + # synthesize_e2e, vocoder is hifigan by default + CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${model_dir} ${output_dir} || exit -1 +fi + diff --git a/paddlespeech/cli/tts/infer.py b/paddlespeech/cli/tts/infer.py index ade8cdd6dc5f4f255a582b40fe6a7aa336b04fa0..65750e1a8f4dd76f4b8c814a2d15c13da6d766ed 100644 --- a/paddlespeech/cli/tts/infer.py +++ b/paddlespeech/cli/tts/infer.py @@ -29,8 +29,7 @@ from yacs.config import CfgNode from ..executor import BaseExecutor from ..log import logger from ..utils import stats_wrapper -from paddlespeech.t2s.frontend import English -from paddlespeech.t2s.frontend.zh_frontend import Frontend +from paddlespeech.t2s.exps.syn_utils import get_frontend from paddlespeech.t2s.modules.normalizer import ZScore __all__ = ['TTSExecutor'] @@ -54,6 +53,7 @@ class TTSExecutor(BaseExecutor): 'fastspeech2_ljspeech', 'fastspeech2_aishell3', 'fastspeech2_vctk', + 'fastspeech2_mix', 'tacotron2_csmsc', 'tacotron2_ljspeech', ], @@ -98,7 +98,7 @@ class TTSExecutor(BaseExecutor): self.parser.add_argument( '--voc', type=str, - default='pwgan_csmsc', + default='hifigan_csmsc', choices=[ 'pwgan_csmsc', 'pwgan_ljspeech', @@ -135,7 +135,7 @@ class TTSExecutor(BaseExecutor): '--lang', type=str, default='zh', - help='Choose model language. zh or en') + help='Choose model language. zh or en or mix') self.parser.add_argument( '--device', type=str, @@ -231,8 +231,11 @@ class TTSExecutor(BaseExecutor): use_pretrained_voc = True else: use_pretrained_voc = False - - voc_tag = voc + '-' + lang + voc_lang = lang + # we must use ljspeech's voc for mix am now! + if lang == 'mix': + voc_lang = 'en' + voc_tag = voc + '-' + voc_lang self.task_resource.set_task_model( model_tag=voc_tag, model_type=1, # vocoder @@ -281,13 +284,8 @@ class TTSExecutor(BaseExecutor): spk_num = len(spk_id) # frontend - if lang == 'zh': - self.frontend = Frontend( - phone_vocab_path=self.phones_dict, - tone_vocab_path=self.tones_dict) - - elif lang == 'en': - self.frontend = English(phone_vocab_path=self.phones_dict) + self.frontend = get_frontend( + lang=lang, phones_dict=self.phones_dict, tones_dict=self.tones_dict) # acoustic model odim = self.am_config.n_mels @@ -381,8 +379,12 @@ class TTSExecutor(BaseExecutor): input_ids = self.frontend.get_input_ids( text, merge_sentences=merge_sentences) phone_ids = input_ids["phone_ids"] + elif lang == 'mix': + input_ids = self.frontend.get_input_ids( + text, merge_sentences=merge_sentences) + phone_ids = input_ids["phone_ids"] else: - logger.error("lang should in {'zh', 'en'}!") + logger.error("lang should in {'zh', 'en', 'mix'}!") self.frontend_time = time.time() - frontend_st self.am_time = 0 @@ -398,7 +400,7 @@ class TTSExecutor(BaseExecutor): # fastspeech2 else: # multi speaker - if am_dataset in {"aishell3", "vctk"}: + if am_dataset in {'aishell3', 'vctk', 'mix'}: mel = self.am_inference( part_phone_ids, spk_id=paddle.to_tensor(spk_id)) else: diff --git a/paddlespeech/resource/pretrained_models.py b/paddlespeech/resource/pretrained_models.py index 324bd3aea2ad83267d50b7647facfcc5bee27523..d7df0e48a38602bbb0e4ff2870980fb99ed7da99 100644 --- a/paddlespeech/resource/pretrained_models.py +++ b/paddlespeech/resource/pretrained_models.py @@ -655,6 +655,24 @@ tts_dynamic_pretrained_models = { 'phone_id_map.txt', }, }, + "fastspeech2_mix-mix": { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_csmscljspeech_add-zhen.zip', + 'md5': + '77d9d4b5a79ed6203339ead7ef6c74f9', + 'config': + 'default.yaml', + 'ckpt': + 'snapshot_iter_94000.pdz', + 'speech_stats': + 'speech_stats.npy', + 'phones_dict': + 'phone_id_map.txt', + 'speaker_dict': + 'speaker_id_map.txt', + }, + }, # tacotron2 "tacotron2_csmsc-zh": { '1.0': { diff --git a/paddlespeech/s2t/models/u2/u2.py b/paddlespeech/s2t/models/u2/u2.py index 432162aaebbf83a8dfa79acad2d53f96110b1fca..ca83ca17046dd016d76c8f9bf873c711e3448635 100644 --- a/paddlespeech/s2t/models/u2/u2.py +++ b/paddlespeech/s2t/models/u2/u2.py @@ -630,13 +630,11 @@ class U2BaseModel(ASRInterface, nn.Layer): (elayers, head, cache_t1, d_k * 2), where `head * d_k == hidden-dim` and `cache_t1 == chunk_size * num_decoding_left_chunks`. - `d_k * 2` for att key & value. Default is 0-dims Tensor, - it is used for dy2st. + `d_k * 2` for att key & value. cnn_cache (paddle.Tensor): cache tensor for cnn_module in conformer, (elayers, b=1, hidden-dim, cache_t2), where - `cache_t2 == cnn.lorder - 1`. Default is 0-dims Tensor, - it is used for dy2st. - + `cache_t2 == cnn.lorder - 1`. + Returns: paddle.Tensor: output of current input xs, with shape (b=1, chunk_size, hidden-dim). diff --git a/paddlespeech/s2t/modules/encoder_layer.py b/paddlespeech/s2t/modules/encoder_layer.py index 9e46cc54045d0edbe87847f73dbffc21ae219408..5f810dfde98d240245dfd2fa8e89cb17f4f84094 100644 --- a/paddlespeech/s2t/modules/encoder_layer.py +++ b/paddlespeech/s2t/modules/encoder_layer.py @@ -76,9 +76,9 @@ class TransformerEncoderLayer(nn.Layer): x: paddle.Tensor, mask: paddle.Tensor, pos_emb: paddle.Tensor, - mask_pad: paddle.Tensor= paddle.ones([0,0,0], dtype=paddle.bool), - att_cache: paddle.Tensor=paddle.zeros([0,0,0,0]), - cnn_cache: paddle.Tensor=paddle.zeros([0,0,0,0]), + mask_pad: paddle.Tensor=paddle.ones([0, 0, 0], dtype=paddle.bool), + att_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]), + cnn_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]), ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]: """Compute encoded features. Args: @@ -105,9 +105,7 @@ class TransformerEncoderLayer(nn.Layer): if self.normalize_before: x = self.norm1(x) - x_att, new_att_cache = self.self_attn( - x, x, x, mask, cache=att_cache - ) + x_att, new_att_cache = self.self_attn(x, x, x, mask, cache=att_cache) if self.concat_after: x_concat = paddle.concat((x, x_att), axis=-1) @@ -124,7 +122,7 @@ class TransformerEncoderLayer(nn.Layer): if not self.normalize_before: x = self.norm2(x) - fake_cnn_cache = paddle.zeros([0,0,0], dtype=x.dtype) + fake_cnn_cache = paddle.zeros([0, 0, 0], dtype=x.dtype) return x, mask, new_att_cache, fake_cnn_cache @@ -195,9 +193,9 @@ class ConformerEncoderLayer(nn.Layer): x: paddle.Tensor, mask: paddle.Tensor, pos_emb: paddle.Tensor, - mask_pad: paddle.Tensor= paddle.ones([0,0,0], dtype=paddle.bool), - att_cache: paddle.Tensor=paddle.zeros([0,0,0,0]), - cnn_cache: paddle.Tensor=paddle.zeros([0,0,0,0]), + mask_pad: paddle.Tensor=paddle.ones([0, 0, 0], dtype=paddle.bool), + att_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]), + cnn_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]), ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]: """Compute encoded features. Args: @@ -211,7 +209,8 @@ class ConformerEncoderLayer(nn.Layer): att_cache (paddle.Tensor): Cache tensor of the KEY & VALUE (#batch=1, head, cache_t1, d_k * 2), head * d_k == size. cnn_cache (paddle.Tensor): Convolution cache in conformer layer - (#batch=1, size, cache_t2) + (1, #batch=1, size, cache_t2). First dim will not be used, just + for dy2st. Returns: paddle.Tensor: Output tensor (#batch, time, size). paddle.Tensor: Mask tensor (#batch, time, time). @@ -219,6 +218,8 @@ class ConformerEncoderLayer(nn.Layer): (#batch=1, head, cache_t1 + time, d_k * 2). paddle.Tensor: cnn_cahce tensor (#batch, size, cache_t2). """ + # (1, #batch=1, size, cache_t2) -> (#batch=1, size, cache_t2) + cnn_cache = paddle.squeeze(cnn_cache, axis=0) # whether to use macaron style FFN if self.feed_forward_macaron is not None: @@ -249,8 +250,7 @@ class ConformerEncoderLayer(nn.Layer): # convolution module # Fake new cnn cache here, and then change it in conv_module - new_cnn_cache = paddle.zeros([0,0,0], dtype=x.dtype) - cnn_cache = paddle.squeeze(cnn_cache, axis=0) + new_cnn_cache = paddle.zeros([0, 0, 0], dtype=x.dtype) if self.conv_module is not None: residual = x if self.normalize_before: @@ -275,4 +275,4 @@ class ConformerEncoderLayer(nn.Layer): if self.conv_module is not None: x = self.norm_final(x) - return x, mask, new_att_cache, new_cnn_cache \ No newline at end of file + return x, mask, new_att_cache, new_cnn_cache diff --git a/paddlespeech/server/engine/engine_warmup.py b/paddlespeech/server/engine/engine_warmup.py index 12c760c6f61ccb8c67a06c79b42b75a6d108cbeb..3751554c21339e9427f11f6dbbfb142dac2cb280 100644 --- a/paddlespeech/server/engine/engine_warmup.py +++ b/paddlespeech/server/engine/engine_warmup.py @@ -60,7 +60,10 @@ def warm_up(engine_and_type: str, warm_up_time: int=3) -> bool: else: st = time.time() - connection_handler.infer(text=sentence) + connection_handler.infer( + text=sentence, + lang=tts_engine.lang, + am=tts_engine.config.am) et = time.time() logger.debug( f"The response time of the {i} warm up: {et - st} s") diff --git a/paddlespeech/t2s/datasets/am_batch_fn.py b/paddlespeech/t2s/datasets/am_batch_fn.py index 1c70b1cdce81f50de80e8a9bf4e1411a114d48a1..2cb7a11a22b02eb25da3c38d6dfdc56d64eb4c25 100644 --- a/paddlespeech/t2s/datasets/am_batch_fn.py +++ b/paddlespeech/t2s/datasets/am_batch_fn.py @@ -28,6 +28,150 @@ from paddlespeech.t2s.modules.nets_utils import phones_masking from paddlespeech.t2s.modules.nets_utils import phones_text_masking +# 因为要传参数,所以需要额外构建 +def build_erniesat_collate_fn(mlm_prob: float=0.8, + mean_phn_span: int=8, + seg_emb: bool=False, + text_masking: bool=False): + + return ErnieSATCollateFn( + mlm_prob=mlm_prob, + mean_phn_span=mean_phn_span, + seg_emb=seg_emb, + text_masking=text_masking) + + +class ErnieSATCollateFn: + """Functor class of common_collate_fn()""" + + def __init__(self, + mlm_prob: float=0.8, + mean_phn_span: int=8, + seg_emb: bool=False, + text_masking: bool=False): + self.mlm_prob = mlm_prob + self.mean_phn_span = mean_phn_span + self.seg_emb = seg_emb + self.text_masking = text_masking + + def __call__(self, exmaples): + return erniesat_batch_fn( + exmaples, + mlm_prob=self.mlm_prob, + mean_phn_span=self.mean_phn_span, + seg_emb=self.seg_emb, + text_masking=self.text_masking) + + +def erniesat_batch_fn(examples, + mlm_prob: float=0.8, + mean_phn_span: int=8, + seg_emb: bool=False, + text_masking: bool=False): + # fields = ["text", "text_lengths", "speech", "speech_lengths", "align_start", "align_end"] + text = [np.array(item["text"], dtype=np.int64) for item in examples] + speech = [np.array(item["speech"], dtype=np.float32) for item in examples] + + text_lengths = [ + np.array(item["text_lengths"], dtype=np.int64) for item in examples + ] + speech_lengths = [ + np.array(item["speech_lengths"], dtype=np.int64) for item in examples + ] + + align_start = [ + np.array(item["align_start"], dtype=np.int64) for item in examples + ] + + align_end = [ + np.array(item["align_end"], dtype=np.int64) for item in examples + ] + + align_start_lengths = [ + np.array(len(item["align_start"]), dtype=np.int64) for item in examples + ] + + # add_pad + text = batch_sequences(text) + speech = batch_sequences(speech) + align_start = batch_sequences(align_start) + align_end = batch_sequences(align_end) + + # convert each batch to paddle.Tensor + text = paddle.to_tensor(text) + speech = paddle.to_tensor(speech) + text_lengths = paddle.to_tensor(text_lengths) + speech_lengths = paddle.to_tensor(speech_lengths) + align_start_lengths = paddle.to_tensor(align_start_lengths) + + speech_pad = speech + text_pad = text + + text_mask = make_non_pad_mask( + text_lengths, text_pad, length_dim=1).unsqueeze(-2) + speech_mask = make_non_pad_mask( + speech_lengths, speech_pad[:, :, 0], length_dim=1).unsqueeze(-2) + + # for training + span_bdy = None + # for inference + if 'span_bdy' in examples[0].keys(): + span_bdy = [ + np.array(item["span_bdy"], dtype=np.int64) for item in examples + ] + span_bdy = paddle.to_tensor(span_bdy) + + # dual_mask 的是混合中英时候同时 mask 语音和文本 + # ernie sat 在实现跨语言的时候都 mask 了 + if text_masking: + masked_pos, text_masked_pos = phones_text_masking( + xs_pad=speech_pad, + src_mask=speech_mask, + text_pad=text_pad, + text_mask=text_mask, + align_start=align_start, + align_end=align_end, + align_start_lens=align_start_lengths, + mlm_prob=mlm_prob, + mean_phn_span=mean_phn_span, + span_bdy=span_bdy) + # 训练纯中文和纯英文的 -> a3t 没有对 phoneme 做 mask, 只对语音 mask 了 + # a3t 和 ernie sat 的区别主要在于做 mask 的时候 + else: + masked_pos = phones_masking( + xs_pad=speech_pad, + src_mask=speech_mask, + align_start=align_start, + align_end=align_end, + align_start_lens=align_start_lengths, + mlm_prob=mlm_prob, + mean_phn_span=mean_phn_span, + span_bdy=span_bdy) + text_masked_pos = paddle.zeros(paddle.shape(text_pad)) + + speech_seg_pos, text_seg_pos = get_seg_pos( + speech_pad=speech_pad, + text_pad=text_pad, + align_start=align_start, + align_end=align_end, + align_start_lens=align_start_lengths, + seg_emb=seg_emb) + + batch = { + "text": text, + "speech": speech, + # need to generate + "masked_pos": masked_pos, + "speech_mask": speech_mask, + "text_mask": text_mask, + "speech_seg_pos": speech_seg_pos, + "text_seg_pos": text_seg_pos, + "text_masked_pos": text_masked_pos + } + + return batch + + def tacotron2_single_spk_batch_fn(examples): # fields = ["text", "text_lengths", "speech", "speech_lengths"] text = [np.array(item["text"], dtype=np.int64) for item in examples] @@ -378,7 +522,6 @@ class MLMCollateFn: mean_phn_span=self.mean_phn_span, seg_emb=self.seg_emb, text_masking=self.text_masking, - attention_window=self.attention_window, not_sequence=self.not_sequence) @@ -389,7 +532,6 @@ def mlm_collate_fn( mean_phn_span: int=8, seg_emb: bool=False, text_masking: bool=False, - attention_window: int=0, pad_value: int=0, not_sequence: Collection[str]=(), ) -> Tuple[List[str], Dict[str, paddle.Tensor]]: @@ -420,6 +562,7 @@ def mlm_collate_fn( feats = feats_extract.get_log_mel_fbank(np.array(output["speech"][0])) feats = paddle.to_tensor(feats) + print("feats.shape:", feats.shape) feats_lens = paddle.shape(feats)[0] feats = paddle.unsqueeze(feats, 0) @@ -439,6 +582,7 @@ def mlm_collate_fn( text_lens, text_pad, length_dim=1).unsqueeze(-2) speech_mask = make_non_pad_mask( feats_lens, speech_pad[:, :, 0], length_dim=1).unsqueeze(-2) + span_bdy = None if 'span_bdy' in output.keys(): span_bdy = output['span_bdy'] diff --git a/paddlespeech/t2s/exps/ernie_sat/__init__.py b/paddlespeech/t2s/exps/ernie_sat/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/paddlespeech/t2s/exps/ernie_sat/align.py b/paddlespeech/t2s/exps/ernie_sat/align.py new file mode 100755 index 0000000000000000000000000000000000000000..529a8221c11de2509928d202db9a0b1f89551dea --- /dev/null +++ b/paddlespeech/t2s/exps/ernie_sat/align.py @@ -0,0 +1,386 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import shutil +from pathlib import Path + +import librosa +import numpy as np +import pypinyin +from praatio import textgrid +from paddlespeech.t2s.exps.ernie_sat.utils import get_tmp_name +from paddlespeech.t2s.exps.ernie_sat.utils import get_dict + + +DICT_EN = 'tools/aligner/cmudict-0.7b' +DICT_ZH = 'tools/aligner/simple.lexicon' +MODEL_DIR_EN = 'tools/aligner/vctk_model.zip' +MODEL_DIR_ZH = 'tools/aligner/aishell3_model.zip' +MFA_PATH = 'tools/montreal-forced-aligner/bin' +os.environ['PATH'] = MFA_PATH + '/:' + os.environ['PATH'] + +def _get_max_idx(dic): + return sorted([int(key.split('_')[0]) for key in dic.keys()])[-1] + + +def _readtg(tg_path: str, lang: str='en', fs: int=24000, n_shift: int=300): + alignment = textgrid.openTextgrid(tg_path, includeEmptyIntervals=True) + phones = [] + ends = [] + words = [] + + for interval in alignment.tierDict['words'].entryList: + word = interval.label + if word: + words.append(word) + for interval in alignment.tierDict['phones'].entryList: + phone = interval.label + phones.append(phone) + ends.append(interval.end) + frame_pos = librosa.time_to_frames(ends, sr=fs, hop_length=n_shift) + durations = np.diff(frame_pos, prepend=0) + assert len(durations) == len(phones) + # merge '' and sp in the end + if phones[-1] == '' and len(phones) > 1 and phones[-2] == 'sp': + phones = phones[:-1] + durations[-2] += durations[-1] + durations = durations[:-1] + + # replace ' and 'sil' with 'sp' + phones = ['sp' if (phn == '' or phn == 'sil') else phn for phn in phones] + + if lang == 'en': + DICT = DICT_EN + + elif lang == 'zh': + DICT = DICT_ZH + + word2phns_dict = get_dict(DICT) + + phn2word_dict = [] + for word in words: + if lang == 'en': + word = word.upper() + phn2word_dict.append([word2phns_dict[word].split(), word]) + + non_sp_idx = 0 + word_idx = 0 + i = 0 + word2phns = {} + while i < len(phones): + phn = phones[i] + if phn == 'sp': + word2phns[str(word_idx) + '_sp'] = ['sp'] + i += 1 + else: + phns, word = phn2word_dict[non_sp_idx] + word2phns[str(word_idx) + '_' + word] = phns + non_sp_idx += 1 + i += len(phns) + word_idx += 1 + sum_phn = sum(len(word2phns[k]) for k in word2phns) + assert sum_phn == len(phones) + + results = '' + for (p, d) in zip(phones, durations): + results += p + ' ' + str(d) + ' ' + return results.strip(), word2phns + + +def alignment(wav_path: str, + text: str, + fs: int=24000, + lang='en', + n_shift: int=300): + wav_name = os.path.basename(wav_path) + utt = wav_name.split('.')[0] + # prepare data for MFA + tmp_name = get_tmp_name(text=text) + tmpbase = './tmp_dir/' + tmp_name + tmpbase = Path(tmpbase) + tmpbase.mkdir(parents=True, exist_ok=True) + print("tmp_name in alignment:",tmp_name) + + shutil.copyfile(wav_path, tmpbase / wav_name) + txt_name = utt + '.txt' + txt_path = tmpbase / txt_name + with open(txt_path, 'w') as wf: + wf.write(text + '\n') + # MFA + if lang == 'en': + DICT = DICT_EN + MODEL_DIR = MODEL_DIR_EN + + elif lang == 'zh': + DICT = DICT_ZH + MODEL_DIR = MODEL_DIR_ZH + else: + print('please input right lang!!') + + CMD = 'mfa_align' + ' ' + str( + tmpbase) + ' ' + DICT + ' ' + MODEL_DIR + ' ' + str(tmpbase) + os.system(CMD) + tg_path = str(tmpbase) + '/' + tmp_name + '/' + utt + '.TextGrid' + phn_dur, word2phns = _readtg(tg_path, lang=lang) + phn_dur = phn_dur.split() + phns = phn_dur[::2] + durs = phn_dur[1::2] + durs = [int(d) for d in durs] + assert len(phns) == len(durs) + return phns, durs, word2phns + + +def words2phns(text: str, lang='en'): + ''' + Args: + text (str): + input text. + eg: for that reason cover is impossible to be given. + lang (str): + 'en' or 'zh' + Returns: + List[str]: phones of input text. + eg: + ['F', 'AO1', 'R', 'DH', 'AE1', 'T', 'R', 'IY1', 'Z', 'AH0', 'N', 'K', 'AH1', 'V', 'ER0', + 'IH1', 'Z', 'IH2', 'M', 'P', 'AA1', 'S', 'AH0', 'B', 'AH0', 'L', 'T', 'UW1', 'B', 'IY1', + 'G', 'IH1', 'V', 'AH0', 'N'] + + Dict(str, str): key - idx_word + value - phones + eg: + {'0_FOR': ['F', 'AO1', 'R'], '1_THAT': ['DH', 'AE1', 'T'], + '2_REASON': ['R', 'IY1', 'Z', 'AH0', 'N'],'3_COVER': ['K', 'AH1', 'V', 'ER0'], '4_IS': ['IH1', 'Z'], + '5_IMPOSSIBLE': ['IH2', 'M', 'P', 'AA1', 'S', 'AH0', 'B', 'AH0', 'L'], + '6_TO': ['T', 'UW1'], '7_BE': ['B', 'IY1'], '8_GIVEN': ['G', 'IH1', 'V', 'AH0', 'N']} + ''' + text = text.strip() + words = [] + for pun in [ + ',', '.', ':', ';', '!', '?', '"', '(', ')', '--', '---', u',', + u'。', u':', u';', u'!', u'?', u'(', u')' + ]: + text = text.replace(pun, ' ') + for wrd in text.split(): + if (wrd[-1] == '-'): + wrd = wrd[:-1] + if (wrd[0] == "'"): + wrd = wrd[1:] + if wrd: + words.append(wrd) + if lang == 'en': + dictfile = DICT_EN + elif lang == 'zh': + dictfile = DICT_ZH + else: + print('please input right lang!!') + + word2phns_dict = get_dict(dictfile) + ds = word2phns_dict.keys() + phns = [] + wrd2phns = {} + for index, wrd in enumerate(words): + if lang == 'en': + wrd = wrd.upper() + if (wrd not in ds): + wrd2phns[str(index) + '_' + wrd] = 'spn' + phns.extend('spn') + else: + wrd2phns[str(index) + '_' + wrd] = word2phns_dict[wrd].split() + phns.extend(word2phns_dict[wrd].split()) + return phns, wrd2phns + + +def get_phns_spans(wav_path: str, + old_str: str='', + new_str: str='', + source_lang: str='en', + target_lang: str='en', + fs: int=24000, + n_shift: int=300): + is_append = (old_str == new_str[:len(old_str)]) + old_phns, mfa_start, mfa_end = [], [], [] + # source + lang = source_lang + phn, dur, w2p = alignment( + wav_path=wav_path, text=old_str, lang=lang, fs=fs, n_shift=n_shift) + + new_d_cumsum = np.pad(np.array(dur).cumsum(0), (1, 0), 'constant').tolist() + mfa_start = new_d_cumsum[:-1] + mfa_end = new_d_cumsum[1:] + old_phns = phn + + # target + if is_append and (source_lang != target_lang): + cross_lingual_clone = True + else: + cross_lingual_clone = False + + if cross_lingual_clone: + str_origin = new_str[:len(old_str)] + str_append = new_str[len(old_str):] + + if target_lang == 'zh': + phns_origin, origin_w2p = words2phns(str_origin, lang='en') + phns_append, append_w2p_tmp = words2phns(str_append, lang='zh') + elif target_lang == 'en': + # 原始句子 + phns_origin, origin_w2p = words2phns(str_origin, lang='zh') + # clone 句子 + phns_append, append_w2p_tmp = words2phns(str_append, lang='en') + else: + assert target_lang == 'zh' or target_lang == 'en', \ + 'cloning is not support for this language, please check it.' + + new_phns = phns_origin + phns_append + + append_w2p = {} + length = len(origin_w2p) + for key, value in append_w2p_tmp.items(): + idx, wrd = key.split('_') + append_w2p[str(int(idx) + length) + '_' + wrd] = value + new_w2p = origin_w2p.copy() + new_w2p.update(append_w2p) + + else: + if source_lang == target_lang: + new_phns, new_w2p = words2phns(new_str, lang=source_lang) + else: + assert source_lang == target_lang, \ + 'source language is not same with target language...' + + span_to_repl = [0, len(old_phns) - 1] + span_to_add = [0, len(new_phns) - 1] + left_idx = 0 + new_phns_left = [] + sp_count = 0 + # find the left different index + # 因为可能 align 时候的 words2phns 和直接 words2phns, 前者会有 sp? + for key in w2p.keys(): + idx, wrd = key.split('_') + if wrd == 'sp': + sp_count += 1 + new_phns_left.append('sp') + else: + idx = str(int(idx) - sp_count) + if idx + '_' + wrd in new_w2p: + # 是 new_str phn 序列的 index + left_idx += len(new_w2p[idx + '_' + wrd]) + # old phn 序列 + new_phns_left.extend(w2p[key]) + else: + span_to_repl[0] = len(new_phns_left) + span_to_add[0] = len(new_phns_left) + break + + # reverse w2p and new_w2p + right_idx = 0 + new_phns_right = [] + sp_count = 0 + w2p_max_idx = _get_max_idx(w2p) + new_w2p_max_idx = _get_max_idx(new_w2p) + new_phns_mid = [] + if is_append: + new_phns_right = [] + new_phns_mid = new_phns[left_idx:] + span_to_repl[0] = len(new_phns_left) + span_to_add[0] = len(new_phns_left) + span_to_add[1] = len(new_phns_left) + len(new_phns_mid) + span_to_repl[1] = len(old_phns) - len(new_phns_right) + # speech edit + else: + for key in list(w2p.keys())[::-1]: + idx, wrd = key.split('_') + if wrd == 'sp': + sp_count += 1 + new_phns_right = ['sp'] + new_phns_right + else: + idx = str(new_w2p_max_idx - (w2p_max_idx - int(idx) - sp_count)) + if idx + '_' + wrd in new_w2p: + right_idx -= len(new_w2p[idx + '_' + wrd]) + new_phns_right = w2p[key] + new_phns_right + else: + span_to_repl[1] = len(old_phns) - len(new_phns_right) + new_phns_mid = new_phns[left_idx:right_idx] + span_to_add[1] = len(new_phns_left) + len(new_phns_mid) + if len(new_phns_mid) == 0: + span_to_add[1] = min(span_to_add[1] + 1, len(new_phns)) + span_to_add[0] = max(0, span_to_add[0] - 1) + span_to_repl[0] = max(0, span_to_repl[0] - 1) + span_to_repl[1] = min(span_to_repl[1] + 1, + len(old_phns)) + break + new_phns = new_phns_left + new_phns_mid + new_phns_right + ''' + For that reason cover should not be given. + For that reason cover is impossible to be given. + span_to_repl: [17, 23] "should not" + span_to_add: [17, 30] "is impossible to" + ''' + outs = {} + outs['mfa_start'] = mfa_start + outs['mfa_end'] = mfa_end + outs['old_phns'] = old_phns + outs['new_phns'] = new_phns + outs['span_to_repl'] = span_to_repl + outs['span_to_add'] = span_to_add + + return outs + + +if __name__ == '__main__': + text = "For that reason cover should not be given." + phn, dur, word2phns = alignment("exp/p243_313.wav", text, lang='en') + print(phn, dur) + print(word2phns) + print("---------------------------------") + # 这里可以用我们的中文前端得到 pinyin 序列 + text_zh = "卡尔普陪外孙玩滑梯。" + text_zh = pypinyin.lazy_pinyin( + text_zh, + neutral_tone_with_five=True, + style=pypinyin.Style.TONE3, + tone_sandhi=True) + text_zh = " ".join(text_zh) + phn, dur, word2phns = alignment("exp/000001.wav", text_zh, lang='zh') + print(phn, dur) + print(word2phns) + print("---------------------------------") + phns, wrd2phns = words2phns(text, lang='en') + print("phns:", phns) + print("wrd2phns:", wrd2phns) + print("---------------------------------") + + phns, wrd2phns = words2phns(text_zh, lang='zh') + print("phns:", phns) + print("wrd2phns:", wrd2phns) + print("---------------------------------") + + outs = get_phns_spans( + wav_path="exp/p243_313.wav", + old_str="For that reason cover should not be given.", + new_str="for that reason cover is impossible to be given.") + + mfa_start = outs["mfa_start"] + mfa_end = outs["mfa_end"] + old_phns = outs["old_phns"] + new_phns = outs["new_phns"] + span_to_repl = outs["span_to_repl"] + span_to_add = outs["span_to_add"] + print("mfa_start:", mfa_start) + print("mfa_end:", mfa_end) + print("old_phns:", old_phns) + print("new_phns:", new_phns) + print("span_to_repl:", span_to_repl) + print("span_to_add:", span_to_add) + print("---------------------------------") diff --git a/paddlespeech/t2s/exps/ernie_sat/normalize.py b/paddlespeech/t2s/exps/ernie_sat/normalize.py new file mode 100644 index 0000000000000000000000000000000000000000..74cdae2a6b85b5b5ce73580e48941fd8d117af1f --- /dev/null +++ b/paddlespeech/t2s/exps/ernie_sat/normalize.py @@ -0,0 +1,130 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Normalize feature files and dump them.""" +import argparse +import logging +from operator import itemgetter +from pathlib import Path + +import jsonlines +import numpy as np +from sklearn.preprocessing import StandardScaler +from tqdm import tqdm + +from paddlespeech.t2s.datasets.data_table import DataTable + + +def main(): + """Run preprocessing process.""" + parser = argparse.ArgumentParser( + description="Normalize dumped raw features (See detail in parallel_wavegan/bin/normalize.py)." + ) + parser.add_argument( + "--metadata", + type=str, + required=True, + help="directory including feature files to be normalized. " + "you need to specify either *-scp or rootdir.") + + parser.add_argument( + "--dumpdir", + type=str, + required=True, + help="directory to dump normalized feature files.") + parser.add_argument( + "--speech-stats", + type=str, + required=True, + help="speech statistics file.") + parser.add_argument( + "--phones-dict", type=str, default=None, help="phone vocabulary file.") + parser.add_argument( + "--speaker-dict", type=str, default=None, help="speaker id map file.") + + args = parser.parse_args() + + dumpdir = Path(args.dumpdir).expanduser() + # use absolute path + dumpdir = dumpdir.resolve() + dumpdir.mkdir(parents=True, exist_ok=True) + + # get dataset + with jsonlines.open(args.metadata, 'r') as reader: + metadata = list(reader) + dataset = DataTable( + metadata, converters={ + "speech": np.load, + }) + logging.info(f"The number of files = {len(dataset)}.") + + # restore scaler + speech_scaler = StandardScaler() + speech_scaler.mean_ = np.load(args.speech_stats)[0] + speech_scaler.scale_ = np.load(args.speech_stats)[1] + speech_scaler.n_features_in_ = speech_scaler.mean_.shape[0] + + vocab_phones = {} + with open(args.phones_dict, 'rt') as f: + phn_id = [line.strip().split() for line in f.readlines()] + for phn, id in phn_id: + vocab_phones[phn] = int(id) + + vocab_speaker = {} + with open(args.speaker_dict, 'rt') as f: + spk_id = [line.strip().split() for line in f.readlines()] + for spk, id in spk_id: + vocab_speaker[spk] = int(id) + + # process each file + output_metadata = [] + + for item in tqdm(dataset): + utt_id = item['utt_id'] + speech = item['speech'] + + # normalize + speech = speech_scaler.transform(speech) + speech_dir = dumpdir / "data_speech" + speech_dir.mkdir(parents=True, exist_ok=True) + speech_path = speech_dir / f"{utt_id}_speech.npy" + np.save(speech_path, speech.astype(np.float32), allow_pickle=False) + + phone_ids = [vocab_phones[p] for p in item['phones']] + spk_id = vocab_speaker[item["speaker"]] + record = { + "utt_id": item['utt_id'], + "spk_id": spk_id, + "text": phone_ids, + "text_lengths": item['text_lengths'], + "speech_lengths": item['speech_lengths'], + "durations": item['durations'], + "speech": str(speech_path), + "align_start": item['align_start'], + "align_end": item['align_end'], + } + # add spk_emb for voice cloning + if "spk_emb" in item: + record["spk_emb"] = str(item["spk_emb"]) + + output_metadata.append(record) + output_metadata.sort(key=itemgetter('utt_id')) + output_metadata_path = Path(args.dumpdir) / "metadata.jsonl" + with jsonlines.open(output_metadata_path, 'w') as writer: + for item in output_metadata: + writer.write(item) + logging.info(f"metadata dumped into {output_metadata_path}") + + +if __name__ == "__main__": + main() diff --git a/paddlespeech/t2s/exps/ernie_sat/preprocess.py b/paddlespeech/t2s/exps/ernie_sat/preprocess.py new file mode 100644 index 0000000000000000000000000000000000000000..fc9e0888b262d79ddba96ebaa095feeb1b1ab315 --- /dev/null +++ b/paddlespeech/t2s/exps/ernie_sat/preprocess.py @@ -0,0 +1,342 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import os +from concurrent.futures import ThreadPoolExecutor +from operator import itemgetter +from pathlib import Path +from typing import Any +from typing import Dict +from typing import List + +import jsonlines +import librosa +import numpy as np +import tqdm +import yaml +from yacs.config import CfgNode + +from paddlespeech.t2s.datasets.get_feats import LogMelFBank +from paddlespeech.t2s.datasets.preprocess_utils import compare_duration_and_mel_length +from paddlespeech.t2s.datasets.preprocess_utils import get_input_token +from paddlespeech.t2s.datasets.preprocess_utils import get_phn_dur +from paddlespeech.t2s.datasets.preprocess_utils import get_spk_id_map +from paddlespeech.t2s.datasets.preprocess_utils import merge_silence +from paddlespeech.t2s.utils import str2bool + + +def process_sentence(config: Dict[str, Any], + fp: Path, + sentences: Dict, + output_dir: Path, + mel_extractor=None, + cut_sil: bool=True, + spk_emb_dir: Path=None): + utt_id = fp.stem + # for vctk + if utt_id.endswith("_mic2"): + utt_id = utt_id[:-5] + record = None + if utt_id in sentences: + # reading, resampling may occur + wav, _ = librosa.load(str(fp), sr=config.fs) + if len(wav.shape) != 1: + return record + max_value = np.abs(wav).max() + if max_value > 1.0: + wav = wav / max_value + assert len(wav.shape) == 1, f"{utt_id} is not a mono-channel audio." + assert np.abs(wav).max( + ) <= 1.0, f"{utt_id} is seems to be different that 16 bit PCM." + phones = sentences[utt_id][0] + durations = sentences[utt_id][1] + speaker = sentences[utt_id][2] + d_cumsum = np.pad(np.array(durations).cumsum(0), (1, 0), 'constant') + + # little imprecise than use *.TextGrid directly + times = librosa.frames_to_time( + d_cumsum, sr=config.fs, hop_length=config.n_shift) + if cut_sil: + start = 0 + end = d_cumsum[-1] + if phones[0] == "sil" and len(durations) > 1: + start = times[1] + durations = durations[1:] + phones = phones[1:] + if phones[-1] == 'sil' and len(durations) > 1: + end = times[-2] + durations = durations[:-1] + phones = phones[:-1] + sentences[utt_id][0] = phones + sentences[utt_id][1] = durations + start, end = librosa.time_to_samples([start, end], sr=config.fs) + wav = wav[start:end] + + # extract mel feats + logmel = mel_extractor.get_log_mel_fbank(wav) + # change duration according to mel_length + compare_duration_and_mel_length(sentences, utt_id, logmel) + # utt_id may be popped in compare_duration_and_mel_length + if utt_id not in sentences: + return None + phones = sentences[utt_id][0] + durations = sentences[utt_id][1] + num_frames = logmel.shape[0] + assert sum(durations) == num_frames + + new_d_cumsum = np.pad(np.array(durations).cumsum(0), (1, 0), 'constant') + align_start = new_d_cumsum[:-1] + align_end = new_d_cumsum[1:] + assert len(align_start) == len(align_end) == len(durations) + + mel_dir = output_dir / "data_speech" + mel_dir.mkdir(parents=True, exist_ok=True) + mel_path = mel_dir / (utt_id + "_speech.npy") + np.save(mel_path, logmel) + # align_start_lengths == text_lengths + record = { + "utt_id": utt_id, + "phones": phones, + "text_lengths": len(phones), + "speech_lengths": num_frames, + "durations": durations, + "speech": str(mel_path), + "speaker": speaker, + "align_start": align_start.tolist(), + "align_end": align_end.tolist(), + } + if spk_emb_dir: + if speaker in os.listdir(spk_emb_dir): + embed_name = utt_id + ".npy" + embed_path = spk_emb_dir / speaker / embed_name + if embed_path.is_file(): + record["spk_emb"] = str(embed_path) + else: + return None + return record + + +def process_sentences(config, + fps: List[Path], + sentences: Dict, + output_dir: Path, + mel_extractor=None, + nprocs: int=1, + cut_sil: bool=True, + spk_emb_dir: Path=None): + if nprocs == 1: + results = [] + for fp in tqdm.tqdm(fps, total=len(fps)): + record = process_sentence( + config=config, + fp=fp, + sentences=sentences, + output_dir=output_dir, + mel_extractor=mel_extractor, + cut_sil=cut_sil, + spk_emb_dir=spk_emb_dir) + if record: + results.append(record) + else: + with ThreadPoolExecutor(nprocs) as pool: + futures = [] + with tqdm.tqdm(total=len(fps)) as progress: + for fp in fps: + future = pool.submit(process_sentence, config, fp, + sentences, output_dir, mel_extractor, + cut_sil, spk_emb_dir) + future.add_done_callback(lambda p: progress.update()) + futures.append(future) + + results = [] + for ft in futures: + record = ft.result() + if record: + results.append(record) + + results.sort(key=itemgetter("utt_id")) + # replace 'w' with 'a' to write from the end of file + with jsonlines.open(output_dir / "metadata.jsonl", 'a') as writer: + for item in results: + writer.write(item) + print("Done") + + +def main(): + # parse config and args + parser = argparse.ArgumentParser( + description="Preprocess audio and then extract features.") + + parser.add_argument( + "--dataset", + default="baker", + type=str, + help="name of dataset, should in {baker, aishell3, ljspeech, vctk} now") + + parser.add_argument( + "--rootdir", default=None, type=str, help="directory to dataset.") + + parser.add_argument( + "--dumpdir", + type=str, + required=True, + help="directory to dump feature files.") + parser.add_argument( + "--dur-file", default=None, type=str, help="path to durations.txt.") + + parser.add_argument("--config", type=str, help="fastspeech2 config file.") + + parser.add_argument( + "--num-cpu", type=int, default=1, help="number of process.") + + parser.add_argument( + "--cut-sil", + type=str2bool, + default=True, + help="whether cut sil in the edge of audio") + + parser.add_argument( + "--spk_emb_dir", + default=None, + type=str, + help="directory to speaker embedding files.") + args = parser.parse_args() + + rootdir = Path(args.rootdir).expanduser() + dumpdir = Path(args.dumpdir).expanduser() + # use absolute path + dumpdir = dumpdir.resolve() + dumpdir.mkdir(parents=True, exist_ok=True) + dur_file = Path(args.dur_file).expanduser() + + if args.spk_emb_dir: + spk_emb_dir = Path(args.spk_emb_dir).expanduser().resolve() + else: + spk_emb_dir = None + + assert rootdir.is_dir() + assert dur_file.is_file() + + with open(args.config, 'rt') as f: + config = CfgNode(yaml.safe_load(f)) + + sentences, speaker_set = get_phn_dur(dur_file) + + merge_silence(sentences) + phone_id_map_path = dumpdir / "phone_id_map.txt" + speaker_id_map_path = dumpdir / "speaker_id_map.txt" + get_input_token(sentences, phone_id_map_path, args.dataset) + get_spk_id_map(speaker_set, speaker_id_map_path) + + if args.dataset == "baker": + wav_files = sorted(list((rootdir / "Wave").rglob("*.wav"))) + # split data into 3 sections + num_train = 9800 + num_dev = 100 + train_wav_files = wav_files[:num_train] + dev_wav_files = wav_files[num_train:num_train + num_dev] + test_wav_files = wav_files[num_train + num_dev:] + elif args.dataset == "aishell3": + sub_num_dev = 5 + wav_dir = rootdir / "train" / "wav" + train_wav_files = [] + dev_wav_files = [] + test_wav_files = [] + for speaker in os.listdir(wav_dir): + wav_files = sorted(list((wav_dir / speaker).rglob("*.wav"))) + if len(wav_files) > 100: + train_wav_files += wav_files[:-sub_num_dev * 2] + dev_wav_files += wav_files[-sub_num_dev * 2:-sub_num_dev] + test_wav_files += wav_files[-sub_num_dev:] + else: + train_wav_files += wav_files + + elif args.dataset == "ljspeech": + wav_files = sorted(list((rootdir / "wavs").rglob("*.wav"))) + # split data into 3 sections + num_train = 12900 + num_dev = 100 + train_wav_files = wav_files[:num_train] + dev_wav_files = wav_files[num_train:num_train + num_dev] + test_wav_files = wav_files[num_train + num_dev:] + elif args.dataset == "vctk": + sub_num_dev = 5 + wav_dir = rootdir / "wav48_silence_trimmed" + train_wav_files = [] + dev_wav_files = [] + test_wav_files = [] + for speaker in os.listdir(wav_dir): + wav_files = sorted(list((wav_dir / speaker).rglob("*_mic2.flac"))) + if len(wav_files) > 100: + train_wav_files += wav_files[:-sub_num_dev * 2] + dev_wav_files += wav_files[-sub_num_dev * 2:-sub_num_dev] + test_wav_files += wav_files[-sub_num_dev:] + else: + train_wav_files += wav_files + + else: + print("dataset should in {baker, aishell3, ljspeech, vctk} now!") + + train_dump_dir = dumpdir / "train" / "raw" + train_dump_dir.mkdir(parents=True, exist_ok=True) + dev_dump_dir = dumpdir / "dev" / "raw" + dev_dump_dir.mkdir(parents=True, exist_ok=True) + test_dump_dir = dumpdir / "test" / "raw" + test_dump_dir.mkdir(parents=True, exist_ok=True) + + # Extractor + mel_extractor = LogMelFBank( + sr=config.fs, + n_fft=config.n_fft, + hop_length=config.n_shift, + win_length=config.win_length, + window=config.window, + n_mels=config.n_mels, + fmin=config.fmin, + fmax=config.fmax) + + # process for the 3 sections + if train_wav_files: + process_sentences( + config=config, + fps=train_wav_files, + sentences=sentences, + output_dir=train_dump_dir, + mel_extractor=mel_extractor, + nprocs=args.num_cpu, + cut_sil=args.cut_sil, + spk_emb_dir=spk_emb_dir) + if dev_wav_files: + process_sentences( + config=config, + fps=dev_wav_files, + sentences=sentences, + output_dir=dev_dump_dir, + mel_extractor=mel_extractor, + cut_sil=args.cut_sil, + spk_emb_dir=spk_emb_dir) + if test_wav_files: + process_sentences( + config=config, + fps=test_wav_files, + sentences=sentences, + output_dir=test_dump_dir, + mel_extractor=mel_extractor, + nprocs=args.num_cpu, + cut_sil=args.cut_sil, + spk_emb_dir=spk_emb_dir) + + +if __name__ == "__main__": + main() diff --git a/paddlespeech/t2s/exps/ernie_sat/synthesize.py b/paddlespeech/t2s/exps/ernie_sat/synthesize.py new file mode 100644 index 0000000000000000000000000000000000000000..2e358294889734c9920a754250ac2946506be62e --- /dev/null +++ b/paddlespeech/t2s/exps/ernie_sat/synthesize.py @@ -0,0 +1,200 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import logging +from pathlib import Path + +import jsonlines +import numpy as np +import paddle +import soundfile as sf +import yaml +from yacs.config import CfgNode + +from paddlespeech.t2s.datasets.am_batch_fn import build_erniesat_collate_fn +from paddlespeech.t2s.exps.syn_utils import denorm +from paddlespeech.t2s.exps.syn_utils import get_am_inference +from paddlespeech.t2s.exps.syn_utils import get_test_dataset +from paddlespeech.t2s.exps.syn_utils import get_voc_inference + + +def evaluate(args): + # dataloader has been too verbose + logging.getLogger("DataLoader").disabled = True + + # construct dataset for evaluation + with jsonlines.open(args.test_metadata, 'r') as reader: + test_metadata = list(reader) + + # Init body. + with open(args.erniesat_config) as f: + erniesat_config = CfgNode(yaml.safe_load(f)) + with open(args.voc_config) as f: + voc_config = CfgNode(yaml.safe_load(f)) + + print("========Args========") + print(yaml.safe_dump(vars(args))) + print("========Config========") + print(erniesat_config) + print(voc_config) + + # ernie sat model + erniesat_inference = get_am_inference( + am='erniesat_dataset', + am_config=erniesat_config, + am_ckpt=args.erniesat_ckpt, + am_stat=args.erniesat_stat, + phones_dict=args.phones_dict) + + test_dataset = get_test_dataset( + test_metadata=test_metadata, am='erniesat_dataset') + + # vocoder + voc_inference = get_voc_inference( + voc=args.voc, + voc_config=voc_config, + voc_ckpt=args.voc_ckpt, + voc_stat=args.voc_stat) + + output_dir = Path(args.output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + collate_fn = build_erniesat_collate_fn( + mlm_prob=erniesat_config.mlm_prob, + mean_phn_span=erniesat_config.mean_phn_span, + seg_emb=erniesat_config.model['enc_input_layer'] == 'sega_mlm', + text_masking=False) + + gen_raw = True + erniesat_mu, erniesat_std = np.load(args.erniesat_stat) + + for datum in test_dataset: + # collate function and dataloader + utt_id = datum["utt_id"] + speech_len = datum["speech_lengths"] + + # mask the middle 1/3 speech + left_bdy, right_bdy = speech_len // 3, 2 * speech_len // 3 + span_bdy = [left_bdy, right_bdy] + datum.update({"span_bdy": span_bdy}) + + batch = collate_fn([datum]) + with paddle.no_grad(): + out_mels = erniesat_inference( + speech=batch["speech"], + text=batch["text"], + masked_pos=batch["masked_pos"], + speech_mask=batch["speech_mask"], + text_mask=batch["text_mask"], + speech_seg_pos=batch["speech_seg_pos"], + text_seg_pos=batch["text_seg_pos"], + span_bdy=span_bdy) + + # vocoder + wav_list = [] + for mel in out_mels: + part_wav = voc_inference(mel) + wav_list.append(part_wav) + wav = paddle.concat(wav_list) + wav = wav.numpy() + if gen_raw: + speech = datum['speech'] + denorm_mel = denorm(speech, erniesat_mu, erniesat_std) + denorm_mel = paddle.to_tensor(denorm_mel) + wav_raw = voc_inference(denorm_mel) + wav_raw = wav_raw.numpy() + + sf.write( + str(output_dir / (utt_id + ".wav")), + wav, + samplerate=erniesat_config.fs) + if gen_raw: + sf.write( + str(output_dir / (utt_id + "_raw" + ".wav")), + wav_raw, + samplerate=erniesat_config.fs) + + print(f"{utt_id} done!") + + +def parse_args(): + # parse args and config + parser = argparse.ArgumentParser( + description="Synthesize with acoustic model & vocoder") + # ernie sat + + parser.add_argument( + '--erniesat_config', + type=str, + default=None, + help='Config of acoustic model.') + parser.add_argument( + '--erniesat_ckpt', + type=str, + default=None, + help='Checkpoint file of acoustic model.') + parser.add_argument( + "--erniesat_stat", + type=str, + default=None, + help="mean and standard deviation used to normalize spectrogram when training acoustic model." + ) + parser.add_argument( + "--phones_dict", type=str, default=None, help="phone vocabulary file.") + # vocoder + parser.add_argument( + '--voc', + type=str, + default='pwgan_csmsc', + choices=[ + 'pwgan_aishell3', + 'pwgan_vctk', + 'hifigan_aishell3', + 'hifigan_vctk', + ], + help='Choose vocoder type of tts task.') + parser.add_argument( + '--voc_config', type=str, default=None, help='Config of voc.') + parser.add_argument( + '--voc_ckpt', type=str, default=None, help='Checkpoint file of voc.') + parser.add_argument( + "--voc_stat", + type=str, + default=None, + help="mean and standard deviation used to normalize spectrogram when training voc." + ) + # other + parser.add_argument( + "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.") + parser.add_argument("--test_metadata", type=str, help="test metadata.") + parser.add_argument("--output_dir", type=str, help="output dir.") + + args = parser.parse_args() + return args + + +def main(): + + args = parse_args() + if args.ngpu == 0: + paddle.set_device("cpu") + elif args.ngpu > 0: + paddle.set_device("gpu") + else: + print("ngpu should >= 0 !") + + evaluate(args) + + +if __name__ == "__main__": + main() diff --git a/paddlespeech/t2s/exps/ernie_sat/synthesize_e2e.py b/paddlespeech/t2s/exps/ernie_sat/synthesize_e2e.py new file mode 100644 index 0000000000000000000000000000000000000000..95b07367ce36f20147d95c22e41c51ca8a262f3d --- /dev/null +++ b/paddlespeech/t2s/exps/ernie_sat/synthesize_e2e.py @@ -0,0 +1,346 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import librosa +import numpy as np +import soundfile as sf + +from paddlespeech.t2s.exps.ernie_sat.align import get_phns_spans +from paddlespeech.t2s.exps.ernie_sat.utils import eval_durs +from paddlespeech.t2s.exps.ernie_sat.utils import get_dur_adj_factor +from paddlespeech.t2s.exps.ernie_sat.utils import get_span_bdy +from paddlespeech.t2s.datasets.am_batch_fn import build_erniesat_collate_fn +from paddlespeech.t2s.exps.syn_utils import get_frontend +from paddlespeech.t2s.datasets.get_feats import LogMelFBank +from paddlespeech.t2s.exps.syn_utils import norm +from paddlespeech.t2s.exps.ernie_sat.utils import get_tmp_name + + + + + + +def _p2id(self, phonemes: List[str]) -> np.ndarray: + # replace unk phone with sp + phonemes = [ + phn if phn in vocab_phones else "sp" for phn in phonemes + ] + phone_ids = [vocab_phones[item] for item in phonemes] + return np.array(phone_ids, np.int64) + + + +def prep_feats_with_dur(wav_path: str, + old_str: str='', + new_str: str='', + source_lang: str='en', + target_lang: str='en', + duration_adjust: bool=True, + fs: int=24000, + n_shift: int=300): + ''' + Returns: + np.ndarray: new wav, replace the part to be edited in original wav with 0 + List[str]: new phones + List[float]: mfa start of new wav + List[float]: mfa end of new wav + List[int]: masked mel boundary of original wav + List[int]: masked mel boundary of new wav + ''' + wav_org, _ = librosa.load(wav_path, sr=fs) + phns_spans_outs = get_phns_spans( + wav_path=wav_path, + old_str=old_str, + new_str=new_str, + source_lang=source_lang, + target_lang=target_lang, + fs=fs, + n_shift=n_shift) + + mfa_start = phns_spans_outs["mfa_start"] + mfa_end = phns_spans_outs["mfa_end"] + old_phns = phns_spans_outs["old_phns"] + new_phns = phns_spans_outs["new_phns"] + span_to_repl = phns_spans_outs["span_to_repl"] + span_to_add = phns_spans_outs["span_to_add"] + + # 中文的 phns 不一定都在 fastspeech2 的字典里, 用 sp 代替 + if target_lang in {'en', 'zh'}: + old_durs = eval_durs(old_phns, target_lang=source_lang) + else: + assert target_lang in {'en', 'zh'}, \ + "calculate duration_predict is not support for this language..." + + orig_old_durs = [e - s for e, s in zip(mfa_end, mfa_start)] + + if duration_adjust: + d_factor = get_dur_adj_factor( + orig_dur=orig_old_durs, pred_dur=old_durs, phns=old_phns) + d_factor = d_factor * 1.25 + else: + d_factor = 1 + + if target_lang in {'en', 'zh'}: + new_durs = eval_durs(new_phns, target_lang=target_lang) + else: + assert target_lang == "zh" or target_lang == "en", \ + "calculate duration_predict is not support for this language..." + + # duration 要是整数 + new_durs_adjusted = [int(np.ceil(d_factor * i)) for i in new_durs] + + new_span_dur_sum = sum(new_durs_adjusted[span_to_add[0]:span_to_add[1]]) + old_span_dur_sum = sum(orig_old_durs[span_to_repl[0]:span_to_repl[1]]) + dur_offset = new_span_dur_sum - old_span_dur_sum + new_mfa_start = mfa_start[:span_to_repl[0]] + new_mfa_end = mfa_end[:span_to_repl[0]] + + for dur in new_durs_adjusted[span_to_add[0]:span_to_add[1]]: + if len(new_mfa_end) == 0: + new_mfa_start.append(0) + new_mfa_end.append(dur) + else: + new_mfa_start.append(new_mfa_end[-1]) + new_mfa_end.append(new_mfa_end[-1] + dur) + + new_mfa_start += [i + dur_offset for i in mfa_start[span_to_repl[1]:]] + new_mfa_end += [i + dur_offset for i in mfa_end[span_to_repl[1]:]] + + # 3. get new wav + # 在原始句子后拼接 + if span_to_repl[0] >= len(mfa_start): + wav_left_idx = len(wav_org) + wav_right_idx = wav_left_idx + # 在原始句子中间替换 + else: + wav_left_idx = int(np.floor(mfa_start[span_to_repl[0]] * n_shift)) + wav_right_idx = int(np.ceil(mfa_end[span_to_repl[1] - 1] * n_shift)) + blank_wav = np.zeros( + (int(np.ceil(new_span_dur_sum * n_shift)), ), dtype=wav_org.dtype) + # 原始音频,需要编辑的部分替换成空音频,空音频的时间由 fs2 的 duration_predictor 决定 + new_wav = np.concatenate( + [wav_org[:wav_left_idx], blank_wav, wav_org[wav_right_idx:]]) + + # 音频是正常遮住了 + sf.write(str("new_wav.wav"), new_wav, samplerate=fs) + + # 4. get old and new mel span to be mask + old_span_bdy = get_span_bdy( + mfa_start=mfa_start, mfa_end=mfa_end, span_to_repl=span_to_repl) + + new_span_bdy = get_span_bdy( + mfa_start=new_mfa_start, mfa_end=new_mfa_end, span_to_repl=span_to_add) + + # old_span_bdy, new_span_bdy 是帧级别的范围 + outs = {} + outs['new_wav'] = new_wav + outs['new_phns'] = new_phns + outs['new_mfa_start'] = new_mfa_start + outs['new_mfa_end'] = new_mfa_end + outs['old_span_bdy'] = old_span_bdy + outs['new_span_bdy'] = new_span_bdy + return outs + + + + +def prep_feats(wav_path: str, + old_str: str='', + new_str: str='', + source_lang: str='en', + target_lang: str='en', + duration_adjust: bool=True, + fs: int=24000, + n_shift: int=300): + + outs = prep_feats_with_dur( + wav_path=wav_path, + old_str=old_str, + new_str=new_str, + source_lang=source_lang, + target_lang=target_lang, + duration_adjust=duration_adjust, + fs=fs, + n_shift=n_shift) + + wav_name = os.path.basename(wav_path) + utt_id = wav_name.split('.')[0] + + wav = outs['new_wav'] + phns = outs['new_phns'] + mfa_start = outs['new_mfa_start'] + mfa_end = outs['new_mfa_end'] + old_span_bdy = outs['old_span_bdy'] + new_span_bdy = outs['new_span_bdy'] + span_bdy = np.array(new_span_bdy) + + text = _p2id(phns) + mel = mel_extractor.get_log_mel_fbank(wav) + erniesat_mean, erniesat_std = np.load(erniesat_stat) + normed_mel = norm(mel, erniesat_mean, erniesat_std) + tmp_name = get_tmp_name(text=old_str) + tmpbase = './tmp_dir/' + tmp_name + tmpbase = Path(tmpbase) + tmpbase.mkdir(parents=True, exist_ok=True) + print("tmp_name in synthesize_e2e:",tmp_name) + + mel_path = tmpbase / 'mel.npy' + print("mel_path:",mel_path) + np.save(mel_path, logmel) + durations = [e - s for e, s in zip(mfa_end, mfa_start)] + + datum={ + "utt_id": utt_id, + "spk_id": 0, + "text": text, + "text_lengths": len(text), + "speech_lengths": 115, + "durations": durations, + "speech": mel_path, + "align_start": mfa_start, + "align_end": mfa_end, + "span_bdy": span_bdy + } + + batch = collate_fn([datum]) + print("batch:",batch) + + return batch, old_span_bdy, new_span_bdy + + +def decode_with_model(mlm_model: nn.Layer, + collate_fn, + wav_path: str, + old_str: str='', + new_str: str='', + source_lang: str='en', + target_lang: str='en', + use_teacher_forcing: bool=False, + duration_adjust: bool=True, + fs: int=24000, + n_shift: int=300, + token_list: List[str]=[]): + batch, old_span_bdy, new_span_bdy = prep_feats( + source_lang=source_lang, + target_lang=target_lang, + wav_path=wav_path, + old_str=old_str, + new_str=new_str, + duration_adjust=duration_adjust, + fs=fs, + n_shift=n_shift, + token_list=token_list) + + + + feats = collate_fn(batch)[1] + + if 'text_masked_pos' in feats.keys(): + feats.pop('text_masked_pos') + + output = mlm_model.inference( + text=feats['text'], + speech=feats['speech'], + masked_pos=feats['masked_pos'], + speech_mask=feats['speech_mask'], + text_mask=feats['text_mask'], + speech_seg_pos=feats['speech_seg_pos'], + text_seg_pos=feats['text_seg_pos'], + span_bdy=new_span_bdy, + use_teacher_forcing=use_teacher_forcing) + + # 拼接音频 + output_feat = paddle.concat(x=output, axis=0) + wav_org, _ = librosa.load(wav_path, sr=fs) + return wav_org, output_feat, old_span_bdy, new_span_bdy, fs, hop_length + + +if __name__ == '__main__': + fs = 24000 + n_shift = 300 + wav_path = "exp/p243_313.wav" + old_str = "For that reason cover should not be given." + # for edit + # new_str = "for that reason cover is impossible to be given." + # for synthesize + append_str = "do you love me i love you so much" + new_str = old_str + append_str + + ''' + outs = prep_feats_with_dur( + wav_path=wav_path, + old_str=old_str, + new_str=new_str, + fs=fs, + n_shift=n_shift) + + new_wav = outs['new_wav'] + new_phns = outs['new_phns'] + new_mfa_start = outs['new_mfa_start'] + new_mfa_end = outs['new_mfa_end'] + old_span_bdy = outs['old_span_bdy'] + new_span_bdy = outs['new_span_bdy'] + + print("---------------------------------") + + print("new_wav:", new_wav) + print("new_phns:", new_phns) + print("new_mfa_start:", new_mfa_start) + print("new_mfa_end:", new_mfa_end) + print("old_span_bdy:", old_span_bdy) + print("new_span_bdy:", new_span_bdy) + print("---------------------------------") + ''' + + erniesat_config = "/home/yuantian01/PaddleSpeech_ERNIE_SAT/PaddleSpeech/examples/vctk/ernie_sat/local/default.yaml" + + with open(erniesat_config) as f: + erniesat_config = CfgNode(yaml.safe_load(f)) + + erniesat_stat = "/home/yuantian01/PaddleSpeech_ERNIE_SAT/PaddleSpeech/examples/vctk/ernie_sat/dump/train/speech_stats.npy" + + # Extractor + mel_extractor = LogMelFBank( + sr=erniesat_config.fs, + n_fft=erniesat_config.n_fft, + hop_length=erniesat_config.n_shift, + win_length=erniesat_config.win_length, + window=erniesat_config.window, + n_mels=erniesat_config.n_mels, + fmin=erniesat_config.fmin, + fmax=erniesat_config.fmax) + + + + collate_fn = build_erniesat_collate_fn( + mlm_prob=erniesat_config.mlm_prob, + mean_phn_span=erniesat_config.mean_phn_span, + seg_emb=erniesat_config.model['enc_input_layer'] == 'sega_mlm', + text_masking=False) + + phones_dict='/home/yuantian01/PaddleSpeech_ERNIE_SAT/PaddleSpeech/examples/vctk/ernie_sat/dump/phone_id_map.txt' + vocab_phones = {} + + with open(phones_dict, 'rt') as f: + phn_id = [line.strip().split() for line in f.readlines()] + for phn, id in phn_id: + vocab_phones[phn] = int(id) + + prep_feats(wav_path=wav_path, + old_str=old_str, + new_str=new_str, + fs=fs, + n_shift=n_shift) + + + diff --git a/paddlespeech/t2s/exps/ernie_sat/train.py b/paddlespeech/t2s/exps/ernie_sat/train.py new file mode 100644 index 0000000000000000000000000000000000000000..ccd1245e1da504866f5aa59c3e4ba4cc591eb2b2 --- /dev/null +++ b/paddlespeech/t2s/exps/ernie_sat/train.py @@ -0,0 +1,203 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import logging +import os +import shutil +from pathlib import Path + +import jsonlines +import numpy as np +import paddle +import yaml +from paddle import DataParallel +from paddle import distributed as dist +from paddle import nn +from paddle.io import DataLoader +from paddle.io import DistributedBatchSampler +from paddle.optimizer import Adam +from yacs.config import CfgNode + +from paddlespeech.t2s.datasets.am_batch_fn import build_erniesat_collate_fn +from paddlespeech.t2s.datasets.data_table import DataTable +from paddlespeech.t2s.models.ernie_sat import ErnieSAT +from paddlespeech.t2s.models.ernie_sat import ErnieSATEvaluator +from paddlespeech.t2s.models.ernie_sat import ErnieSATUpdater +from paddlespeech.t2s.training.extensions.snapshot import Snapshot +from paddlespeech.t2s.training.extensions.visualizer import VisualDL +from paddlespeech.t2s.training.seeding import seed_everything +from paddlespeech.t2s.training.trainer import Trainer + + +def train_sp(args, config): + # decides device type and whether to run in parallel + # setup running environment correctly + if (not paddle.is_compiled_with_cuda()) or args.ngpu == 0: + paddle.set_device("cpu") + else: + paddle.set_device("gpu") + world_size = paddle.distributed.get_world_size() + if world_size > 1: + paddle.distributed.init_parallel_env() + + # set the random seed, it is a must for multiprocess training + seed_everything(config.seed) + + print( + f"rank: {dist.get_rank()}, pid: {os.getpid()}, parent_pid: {os.getppid()}", + ) + fields = [ + "text", "text_lengths", "speech", "speech_lengths", "align_start", + "align_end" + ] + converters = {"speech": np.load} + # dataloader has been too verbose + logging.getLogger("DataLoader").disabled = True + + # construct dataset for training and validation + with jsonlines.open(args.train_metadata, 'r') as reader: + train_metadata = list(reader) + train_dataset = DataTable( + data=train_metadata, + fields=fields, + converters=converters, ) + with jsonlines.open(args.dev_metadata, 'r') as reader: + dev_metadata = list(reader) + dev_dataset = DataTable( + data=dev_metadata, + fields=fields, + converters=converters, ) + + # collate function and dataloader + collate_fn = build_erniesat_collate_fn( + mlm_prob=config.mlm_prob, + mean_phn_span=config.mean_phn_span, + seg_emb=config.model['enc_input_layer'] == 'sega_mlm', + text_masking=config["model"]["text_masking"]) + + train_sampler = DistributedBatchSampler( + train_dataset, + batch_size=config.batch_size, + shuffle=True, + drop_last=True) + + print("samplers done!") + + train_dataloader = DataLoader( + train_dataset, + batch_sampler=train_sampler, + collate_fn=collate_fn, + num_workers=config.num_workers) + + dev_dataloader = DataLoader( + dev_dataset, + shuffle=False, + drop_last=False, + batch_size=config.batch_size, + collate_fn=collate_fn, + num_workers=config.num_workers) + print("dataloaders done!") + + with open(args.phones_dict, "r") as f: + phn_id = [line.strip().split() for line in f.readlines()] + vocab_size = len(phn_id) + print("vocab_size:", vocab_size) + + odim = config.n_mels + model = ErnieSAT(idim=vocab_size, odim=odim, **config["model"]) + + if world_size > 1: + model = DataParallel(model) + print("model done!") + + scheduler = paddle.optimizer.lr.NoamDecay( + d_model=config["scheduler_params"]["d_model"], + warmup_steps=config["scheduler_params"]["warmup_steps"]) + grad_clip = nn.ClipGradByGlobalNorm(config["grad_clip"]) + optimizer = Adam( + learning_rate=scheduler, + grad_clip=grad_clip, + parameters=model.parameters()) + + print("optimizer done!") + + output_dir = Path(args.output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + if dist.get_rank() == 0: + config_name = args.config.split("/")[-1] + # copy conf to output_dir + shutil.copyfile(args.config, output_dir / config_name) + + updater = ErnieSATUpdater( + model=model, + optimizer=optimizer, + scheduler=scheduler, + dataloader=train_dataloader, + text_masking=config["model"]["text_masking"], + odim=odim, + vocab_size=vocab_size, + output_dir=output_dir) + + trainer = Trainer(updater, (config.max_epoch, 'epoch'), output_dir) + + evaluator = ErnieSATEvaluator( + model=model, + dataloader=dev_dataloader, + text_masking=config["model"]["text_masking"], + odim=odim, + vocab_size=vocab_size, + output_dir=output_dir, ) + + if dist.get_rank() == 0: + trainer.extend(evaluator, trigger=(1, "epoch")) + trainer.extend(VisualDL(output_dir), trigger=(1, "iteration")) + trainer.extend( + Snapshot(max_size=config.num_snapshots), trigger=(1, 'epoch')) + trainer.run() + + +def main(): + # parse args and config and redirect to train_sp + parser = argparse.ArgumentParser(description="Train an ErnieSAT model.") + parser.add_argument("--config", type=str, help="ErnieSAT config file.") + parser.add_argument("--train-metadata", type=str, help="training data.") + parser.add_argument("--dev-metadata", type=str, help="dev data.") + parser.add_argument("--output-dir", type=str, help="output dir.") + parser.add_argument( + "--ngpu", type=int, default=1, help="if ngpu=0, use cpu.") + parser.add_argument( + "--phones-dict", type=str, default=None, help="phone vocabulary file.") + + args = parser.parse_args() + + with open(args.config) as f: + config = CfgNode(yaml.safe_load(f)) + + print("========Args========") + print(yaml.safe_dump(vars(args))) + print("========Config========") + print(config) + print( + f"master see the word size: {dist.get_world_size()}, from pid: {os.getpid()}" + ) + + # dispatch + if args.ngpu > 1: + dist.spawn(train_sp, (args, config), nprocs=args.ngpu) + else: + train_sp(args, config) + + +if __name__ == "__main__": + main() diff --git a/paddlespeech/t2s/exps/ernie_sat/utils.py b/paddlespeech/t2s/exps/ernie_sat/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..9169efa360d84f243bbe2e9600af299a60de2a4b --- /dev/null +++ b/paddlespeech/t2s/exps/ernie_sat/utils.py @@ -0,0 +1,216 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from pathlib import Path +from typing import Dict +from typing import List +from typing import Union +import os + +import numpy as np +import paddle +import yaml +from yacs.config import CfgNode +import hashlib + + +from paddlespeech.t2s.exps.syn_utils import get_am_inference +from paddlespeech.t2s.exps.syn_utils import get_voc_inference + +def _get_user(): + return os.path.expanduser('~').split('/')[-1] + +def str2md5(string): + md5_val = hashlib.md5(string.encode('utf8')).hexdigest() + return md5_val + +def get_tmp_name(text:str): + return _get_user() + '_' + str(os.getpid()) + '_' + str2md5(text) + +def get_dict(dictfile: str): + word2phns_dict = {} + with open(dictfile, 'r') as fid: + for line in fid: + line_lst = line.split() + word, phn_lst = line_lst[0], line.split()[1:] + if word not in word2phns_dict.keys(): + word2phns_dict[word] = ' '.join(phn_lst) + return word2phns_dict + + +# 获取需要被 mask 的 mel 帧的范围 +def get_span_bdy(mfa_start: List[float], + mfa_end: List[float], + span_to_repl: List[List[int]]): + if span_to_repl[0] >= len(mfa_start): + span_bdy = [mfa_end[-1], mfa_end[-1]] + else: + span_bdy = [mfa_start[span_to_repl[0]], mfa_end[span_to_repl[1] - 1]] + return span_bdy + + +# mfa 获得的 duration 和 fs2 的 duration_predictor 获取的 duration 可能不同 +# 此处获得一个缩放比例, 用于预测值和真实值之间的缩放 +def get_dur_adj_factor(orig_dur: List[int], + pred_dur: List[int], + phns: List[str]): + length = 0 + factor_list = [] + for orig, pred, phn in zip(orig_dur, pred_dur, phns): + if pred == 0 or phn == 'sp': + continue + else: + factor_list.append(orig / pred) + factor_list = np.array(factor_list) + factor_list.sort() + if len(factor_list) < 5: + return 1 + length = 2 + avg = np.average(factor_list[length:-length]) + return avg + + +def read_2col_text(path: Union[Path, str]) -> Dict[str, str]: + """Read a text file having 2 column as dict object. + + Examples: + wav.scp: + key1 /some/path/a.wav + key2 /some/path/b.wav + + >>> read_2col_text('wav.scp') + {'key1': '/some/path/a.wav', 'key2': '/some/path/b.wav'} + + """ + + data = {} + with Path(path).open("r", encoding="utf-8") as f: + for linenum, line in enumerate(f, 1): + sps = line.rstrip().split(maxsplit=1) + if len(sps) == 1: + k, v = sps[0], "" + else: + k, v = sps + if k in data: + raise RuntimeError(f"{k} is duplicated ({path}:{linenum})") + data[k] = v + return data + + +def load_num_sequence_text(path: Union[Path, str], loader_type: str="csv_int" + ) -> Dict[str, List[Union[float, int]]]: + """Read a text file indicating sequences of number + + Examples: + key1 1 2 3 + key2 34 5 6 + + >>> d = load_num_sequence_text('text') + >>> np.testing.assert_array_equal(d["key1"], np.array([1, 2, 3])) + """ + if loader_type == "text_int": + delimiter = " " + dtype = int + elif loader_type == "text_float": + delimiter = " " + dtype = float + elif loader_type == "csv_int": + delimiter = "," + dtype = int + elif loader_type == "csv_float": + delimiter = "," + dtype = float + else: + raise ValueError(f"Not supported loader_type={loader_type}") + + # path looks like: + # utta 1,0 + # uttb 3,4,5 + # -> return {'utta': np.ndarray([1, 0]), + # 'uttb': np.ndarray([3, 4, 5])} + d = read_2column_text(path) + # Using for-loop instead of dict-comprehension for debuggability + retval = {} + for k, v in d.items(): + try: + retval[k] = [dtype(i) for i in v.split(delimiter)] + except TypeError: + print(f'Error happened with path="{path}", id="{k}", value="{v}"') + raise + return retval + + +def is_chinese(ch): + if u'\u4e00' <= ch <= u'\u9fff': + return True + else: + return False + + +def get_voc_out(mel): + # vocoder + args = parse_args() + with open(args.voc_config) as f: + voc_config = CfgNode(yaml.safe_load(f)) + voc_inference = get_voc_inference( + voc=args.voc, + voc_config=voc_config, + voc_ckpt=args.voc_ckpt, + voc_stat=args.voc_stat) + + with paddle.no_grad(): + wav = voc_inference(mel) + return np.squeeze(wav) + + +def eval_durs(phns, target_lang: str='zh', fs: int=24000, n_shift: int=300): + + if target_lang == 'en': + am = "fastspeech2_ljspeech" + am_config = "download/fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml" + am_ckpt = "download/fastspeech2_nosil_ljspeech_ckpt_0.5/snapshot_iter_100000.pdz" + am_stat = "download/fastspeech2_nosil_ljspeech_ckpt_0.5/speech_stats.npy" + phones_dict = "download/fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt" + + elif target_lang == 'zh': + am = "fastspeech2_csmsc" + am_config = "download/fastspeech2_conformer_baker_ckpt_0.5/conformer.yaml" + am_ckpt = "download/fastspeech2_conformer_baker_ckpt_0.5/snapshot_iter_76000.pdz" + am_stat = "download/fastspeech2_conformer_baker_ckpt_0.5/speech_stats.npy" + phones_dict = "download/fastspeech2_conformer_baker_ckpt_0.5/phone_id_map.txt" + + # Init body. + with open(am_config) as f: + am_config = CfgNode(yaml.safe_load(f)) + + am_inference, am = get_am_inference( + am=am, + am_config=am_config, + am_ckpt=am_ckpt, + am_stat=am_stat, + phones_dict=phones_dict, + return_am=True) + + vocab_phones = {} + with open(phones_dict, "r") as f: + phn_id = [line.strip().split() for line in f.readlines()] + for tone, id in phn_id: + vocab_phones[tone] = int(id) + vocab_size = len(vocab_phones) + phonemes = [phn if phn in vocab_phones else "sp" for phn in phns] + + phone_ids = [vocab_phones[item] for item in phonemes] + phone_ids = paddle.to_tensor(np.array(phone_ids, np.int64)) + _, d_outs, _, _ = am.inference(phone_ids) + d_outs = d_outs.tolist() + return d_outs diff --git a/paddlespeech/t2s/exps/sentences_mix.txt b/paddlespeech/t2s/exps/sentences_mix.txt new file mode 100644 index 0000000000000000000000000000000000000000..06e97d14a82070f9e80f28a7e1d0468c8afce88c --- /dev/null +++ b/paddlespeech/t2s/exps/sentences_mix.txt @@ -0,0 +1,8 @@ +001 你好,欢迎使用 Paddle Speech 中英文混合 T T S 功能,开始你的合成之旅吧! +002 我们的声学模型使用了 Fast Speech Two, 声码器使用了 Parallel Wave GAN and Hifi GAN. +003 Paddle N L P 发布 ERNIE Tiny 全系列中文预训练小模型,快速提升预训练模型部署效率,通用信息抽取技术 U I E Tiny 系列模型全新升级,支持速度更快效果更好的 U I E 小模型。 +004 Paddle Speech 发布 P P A S R 流式语音识别系统、P P T T S 流式语音合成系统、P P V P R 全链路声纹识别系统。 +005 Paddle Bo Bo: 使用 Paddle Speech 的语音合成模块生成虚拟人的声音。 +006 热烈欢迎您在 Discussions 中提交问题,并在 Issues 中指出发现的 bug。此外,我们非常希望您参与到 Paddle Speech 的开发中! +007 我喜欢 eat apple, 你喜欢 drink milk。 +008 我们要去云南 team building, 非常非常 happy. \ No newline at end of file diff --git a/paddlespeech/t2s/exps/syn_utils.py b/paddlespeech/t2s/exps/syn_utils.py index 77abf97d9227f0a5df042a1a831d6cdc902bc499..bade62acac3c926402f4c3e488b9ad3e88e04d37 100644 --- a/paddlespeech/t2s/exps/syn_utils.py +++ b/paddlespeech/t2s/exps/syn_utils.py @@ -69,6 +69,10 @@ model_alias = { "paddlespeech.t2s.models.wavernn:WaveRNN", "wavernn_inference": "paddlespeech.t2s.models.wavernn:WaveRNNInference", + "erniesat": + "paddlespeech.t2s.models.ernie_sat:ErnieSAT", + "erniesat_inference": + "paddlespeech.t2s.models.ernie_sat:ErnieSATInference", } @@ -112,6 +116,7 @@ def get_test_dataset(test_metadata: List[Dict[str, Any]], # model: {model_name}_{dataset} am_name = am[:am.rindex('_')] am_dataset = am[am.rindex('_') + 1:] + converters = {} if am_name == 'fastspeech2': fields = ["utt_id", "text"] if am_dataset in {"aishell3", "vctk", @@ -130,8 +135,17 @@ def get_test_dataset(test_metadata: List[Dict[str, Any]], if voice_cloning: print("voice cloning!") fields += ["spk_emb"] + elif am_name == 'erniesat': + fields = [ + "utt_id", "text", "text_lengths", "speech", "speech_lengths", + "align_start", "align_end" + ] + converters = {"speech": np.load} + else: + print("wrong am, please input right am!!!") - test_dataset = DataTable(data=test_metadata, fields=fields) + test_dataset = DataTable( + data=test_metadata, fields=fields, converters=converters) return test_dataset @@ -201,6 +215,10 @@ def get_am_inference(am: str='fastspeech2_csmsc', **am_config["model"]) elif am_name == 'tacotron2': am = am_class(idim=vocab_size, odim=odim, **am_config["model"]) + elif am_name == 'erniesat': + am = am_class(idim=vocab_size, odim=odim, **am_config["model"]) + else: + print("wrong am, please input right am!!!") am.set_state_dict(paddle.load(am_ckpt)["main_params"]) am.eval() diff --git a/paddlespeech/t2s/models/ernie_sat/__init__.py b/paddlespeech/t2s/models/ernie_sat/__init__.py index dc86fa5146429e8c08c8a6ca00759980a6e92b6e..7e795370e7b89f625150093dcaa245abe09c84e2 100644 --- a/paddlespeech/t2s/models/ernie_sat/__init__.py +++ b/paddlespeech/t2s/models/ernie_sat/__init__.py @@ -1,4 +1,4 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -11,4 +11,6 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. +from .ernie_sat import * +from .ernie_sat_updater import * from .mlm import * diff --git a/paddlespeech/t2s/models/ernie_sat/ernie_sat.py b/paddlespeech/t2s/models/ernie_sat/ernie_sat.py new file mode 100644 index 0000000000000000000000000000000000000000..54f5d542d035830f4f90c56daac659debe498d53 --- /dev/null +++ b/paddlespeech/t2s/models/ernie_sat/ernie_sat.py @@ -0,0 +1,705 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from typing import Dict +from typing import List +from typing import Optional + +import paddle +from paddle import nn + +from paddlespeech.t2s.modules.activation import get_activation +from paddlespeech.t2s.modules.conformer.convolution import ConvolutionModule +from paddlespeech.t2s.modules.conformer.encoder_layer import EncoderLayer +from paddlespeech.t2s.modules.layer_norm import LayerNorm +from paddlespeech.t2s.modules.masked_fill import masked_fill +from paddlespeech.t2s.modules.nets_utils import initialize +from paddlespeech.t2s.modules.tacotron2.decoder import Postnet +from paddlespeech.t2s.modules.transformer.attention import LegacyRelPositionMultiHeadedAttention +from paddlespeech.t2s.modules.transformer.attention import MultiHeadedAttention +from paddlespeech.t2s.modules.transformer.attention import RelPositionMultiHeadedAttention +from paddlespeech.t2s.modules.transformer.embedding import LegacyRelPositionalEncoding +from paddlespeech.t2s.modules.transformer.embedding import PositionalEncoding +from paddlespeech.t2s.modules.transformer.embedding import RelPositionalEncoding +from paddlespeech.t2s.modules.transformer.embedding import ScaledPositionalEncoding +from paddlespeech.t2s.modules.transformer.multi_layer_conv import Conv1dLinear +from paddlespeech.t2s.modules.transformer.multi_layer_conv import MultiLayeredConv1d +from paddlespeech.t2s.modules.transformer.positionwise_feed_forward import PositionwiseFeedForward +from paddlespeech.t2s.modules.transformer.repeat import repeat +from paddlespeech.t2s.modules.transformer.subsampling import Conv2dSubsampling + + +# MLM -> Mask Language Model +class mySequential(nn.Sequential): + def forward(self, *inputs): + for module in self._sub_layers.values(): + if type(inputs) == tuple: + inputs = module(*inputs) + else: + inputs = module(inputs) + return inputs + + +class MaskInputLayer(nn.Layer): + def __init__(self, out_features: int) -> None: + super().__init__() + self.mask_feature = paddle.create_parameter( + shape=(1, 1, out_features), + dtype=paddle.float32, + default_initializer=paddle.nn.initializer.Assign( + paddle.normal(shape=(1, 1, out_features)))) + + def forward(self, input: paddle.Tensor, + masked_pos: paddle.Tensor=None) -> paddle.Tensor: + masked_pos = paddle.expand_as(paddle.unsqueeze(masked_pos, -1), input) + masked_input = masked_fill(input, masked_pos, 0) + masked_fill( + paddle.expand_as(self.mask_feature, input), ~masked_pos, 0) + return masked_input + + +class MLMEncoder(nn.Layer): + """Conformer encoder module. + + Args: + idim (int): Input dimension. + attention_dim (int): Dimension of attention. + attention_heads (int): The number of heads of multi head attention. + linear_units (int): The number of units of position-wise feed forward. + num_blocks (int): The number of decoder blocks. + dropout_rate (float): Dropout rate. + positional_dropout_rate (float): Dropout rate after adding positional encoding. + attention_dropout_rate (float): Dropout rate in attention. + input_layer (Union[str, paddle.nn.Layer]): Input layer type. + normalize_before (bool): Whether to use layer_norm before the first block. + concat_after (bool): Whether to concat attention layer's input and output. + if True, additional linear will be applied. + i.e. x -> x + linear(concat(x, att(x))) + if False, no additional linear will be applied. i.e. x -> x + att(x) + positionwise_layer_type (str): "linear", "conv1d", or "conv1d-linear". + positionwise_conv_kernel_size (int): Kernel size of positionwise conv1d layer. + macaron_style (bool): Whether to use macaron style for positionwise layer. + pos_enc_layer_type (str): Encoder positional encoding layer type. + selfattention_layer_type (str): Encoder attention layer type. + activation_type (str): Encoder activation function type. + use_cnn_module (bool): Whether to use convolution module. + zero_triu (bool): Whether to zero the upper triangular part of attention matrix. + cnn_module_kernel (int): Kernerl size of convolution module. + padding_idx (int): Padding idx for input_layer=embed. + stochastic_depth_rate (float): Maximum probability to skip the encoder layer. + + """ + + def __init__(self, + idim: int, + vocab_size: int=0, + pre_speech_layer: int=0, + attention_dim: int=256, + attention_heads: int=4, + linear_units: int=2048, + num_blocks: int=6, + dropout_rate: float=0.1, + positional_dropout_rate: float=0.1, + attention_dropout_rate: float=0.0, + input_layer: str="conv2d", + normalize_before: bool=True, + concat_after: bool=False, + positionwise_layer_type: str="linear", + positionwise_conv_kernel_size: int=1, + macaron_style: bool=False, + pos_enc_layer_type: str="abs_pos", + pos_enc_class=None, + selfattention_layer_type: str="selfattn", + activation_type: str="swish", + use_cnn_module: bool=False, + zero_triu: bool=False, + cnn_module_kernel: int=31, + padding_idx: int=-1, + stochastic_depth_rate: float=0.0, + text_masking: bool=False): + """Construct an Encoder object.""" + super().__init__() + self._output_size = attention_dim + self.text_masking = text_masking + if self.text_masking: + self.text_masking_layer = MaskInputLayer(attention_dim) + activation = get_activation(activation_type) + if pos_enc_layer_type == "abs_pos": + pos_enc_class = PositionalEncoding + elif pos_enc_layer_type == "scaled_abs_pos": + pos_enc_class = ScaledPositionalEncoding + elif pos_enc_layer_type == "rel_pos": + assert selfattention_layer_type == "rel_selfattn" + pos_enc_class = RelPositionalEncoding + elif pos_enc_layer_type == "legacy_rel_pos": + pos_enc_class = LegacyRelPositionalEncoding + assert selfattention_layer_type == "legacy_rel_selfattn" + else: + raise ValueError("unknown pos_enc_layer: " + pos_enc_layer_type) + + self.conv_subsampling_factor = 1 + if input_layer == "linear": + self.embed = nn.Sequential( + nn.Linear(idim, attention_dim), + nn.LayerNorm(attention_dim), + nn.Dropout(dropout_rate), + nn.ReLU(), + pos_enc_class(attention_dim, positional_dropout_rate), ) + elif input_layer == "conv2d": + self.embed = Conv2dSubsampling( + idim, + attention_dim, + dropout_rate, + pos_enc_class(attention_dim, positional_dropout_rate), ) + self.conv_subsampling_factor = 4 + elif input_layer == "embed": + self.embed = nn.Sequential( + nn.Embedding(idim, attention_dim, padding_idx=padding_idx), + pos_enc_class(attention_dim, positional_dropout_rate), ) + elif input_layer == "mlm": + self.segment_emb = None + self.speech_embed = mySequential( + MaskInputLayer(idim), + nn.Linear(idim, attention_dim), + nn.LayerNorm(attention_dim), + nn.ReLU(), + pos_enc_class(attention_dim, positional_dropout_rate)) + self.text_embed = nn.Sequential( + nn.Embedding( + vocab_size, attention_dim, padding_idx=padding_idx), + pos_enc_class(attention_dim, positional_dropout_rate), ) + elif input_layer == "sega_mlm": + self.segment_emb = nn.Embedding( + 500, attention_dim, padding_idx=padding_idx) + self.speech_embed = mySequential( + MaskInputLayer(idim), + nn.Linear(idim, attention_dim), + nn.LayerNorm(attention_dim), + nn.ReLU(), + pos_enc_class(attention_dim, positional_dropout_rate)) + self.text_embed = nn.Sequential( + nn.Embedding( + vocab_size, attention_dim, padding_idx=padding_idx), + pos_enc_class(attention_dim, positional_dropout_rate), ) + elif isinstance(input_layer, nn.Layer): + self.embed = nn.Sequential( + input_layer, + pos_enc_class(attention_dim, positional_dropout_rate), ) + elif input_layer is None: + self.embed = nn.Sequential( + pos_enc_class(attention_dim, positional_dropout_rate)) + else: + raise ValueError("unknown input_layer: " + input_layer) + self.normalize_before = normalize_before + + # self-attention module definition + if selfattention_layer_type == "selfattn": + encoder_selfattn_layer = MultiHeadedAttention + encoder_selfattn_layer_args = (attention_heads, attention_dim, + attention_dropout_rate, ) + elif selfattention_layer_type == "legacy_rel_selfattn": + assert pos_enc_layer_type == "legacy_rel_pos" + encoder_selfattn_layer = LegacyRelPositionMultiHeadedAttention + encoder_selfattn_layer_args = (attention_heads, attention_dim, + attention_dropout_rate, ) + elif selfattention_layer_type == "rel_selfattn": + assert pos_enc_layer_type == "rel_pos" + encoder_selfattn_layer = RelPositionMultiHeadedAttention + encoder_selfattn_layer_args = (attention_heads, attention_dim, + attention_dropout_rate, zero_triu, ) + else: + raise ValueError("unknown encoder_attn_layer: " + + selfattention_layer_type) + + # feed-forward module definition + if positionwise_layer_type == "linear": + positionwise_layer = PositionwiseFeedForward + positionwise_layer_args = (attention_dim, linear_units, + dropout_rate, activation, ) + elif positionwise_layer_type == "conv1d": + positionwise_layer = MultiLayeredConv1d + positionwise_layer_args = (attention_dim, linear_units, + positionwise_conv_kernel_size, + dropout_rate, ) + elif positionwise_layer_type == "conv1d-linear": + positionwise_layer = Conv1dLinear + positionwise_layer_args = (attention_dim, linear_units, + positionwise_conv_kernel_size, + dropout_rate, ) + else: + raise NotImplementedError("Support only linear or conv1d.") + + # convolution module definition + convolution_layer = ConvolutionModule + convolution_layer_args = (attention_dim, cnn_module_kernel, activation) + + self.encoders = repeat( + num_blocks, + lambda lnum: EncoderLayer( + attention_dim, + encoder_selfattn_layer(*encoder_selfattn_layer_args), + positionwise_layer(*positionwise_layer_args), + positionwise_layer(*positionwise_layer_args) if macaron_style else None, + convolution_layer(*convolution_layer_args) if use_cnn_module else None, + dropout_rate, + normalize_before, + concat_after, + stochastic_depth_rate * float(1 + lnum) / num_blocks, ), ) + self.pre_speech_layer = pre_speech_layer + self.pre_speech_encoders = repeat( + self.pre_speech_layer, + lambda lnum: EncoderLayer( + attention_dim, + encoder_selfattn_layer(*encoder_selfattn_layer_args), + positionwise_layer(*positionwise_layer_args), + positionwise_layer(*positionwise_layer_args) if macaron_style else None, + convolution_layer(*convolution_layer_args) if use_cnn_module else None, + dropout_rate, + normalize_before, + concat_after, + stochastic_depth_rate * float(1 + lnum) / self.pre_speech_layer, ), + ) + if self.normalize_before: + self.after_norm = LayerNorm(attention_dim) + + def forward(self, + speech: paddle.Tensor, + text: paddle.Tensor, + masked_pos: paddle.Tensor, + speech_mask: paddle.Tensor=None, + text_mask: paddle.Tensor=None, + speech_seg_pos: paddle.Tensor=None, + text_seg_pos: paddle.Tensor=None): + """Encode input sequence. + + """ + if masked_pos is not None: + speech = self.speech_embed(speech, masked_pos) + else: + speech = self.speech_embed(speech) + if text is not None: + text = self.text_embed(text) + if speech_seg_pos is not None and text_seg_pos is not None and self.segment_emb: + speech_seg_emb = self.segment_emb(speech_seg_pos) + text_seg_emb = self.segment_emb(text_seg_pos) + text = (text[0] + text_seg_emb, text[1]) + speech = (speech[0] + speech_seg_emb, speech[1]) + if self.pre_speech_encoders: + speech, _ = self.pre_speech_encoders(speech, speech_mask) + + if text is not None: + xs = paddle.concat([speech[0], text[0]], axis=1) + xs_pos_emb = paddle.concat([speech[1], text[1]], axis=1) + masks = paddle.concat([speech_mask, text_mask], axis=-1) + else: + xs = speech[0] + xs_pos_emb = speech[1] + masks = speech_mask + + xs, masks = self.encoders((xs, xs_pos_emb), masks) + + if isinstance(xs, tuple): + xs = xs[0] + if self.normalize_before: + xs = self.after_norm(xs) + + return xs, masks + + +class MLMDecoder(MLMEncoder): + def forward(self, xs: paddle.Tensor, masks: paddle.Tensor): + """Encode input sequence. + + Args: + xs (paddle.Tensor): Input tensor (#batch, time, idim). + masks (paddle.Tensor): Mask tensor (#batch, time). + + Returns: + paddle.Tensor: Output tensor (#batch, time, attention_dim). + paddle.Tensor: Mask tensor (#batch, time). + + """ + xs = self.embed(xs) + xs, masks = self.encoders(xs, masks) + + if isinstance(xs, tuple): + xs = xs[0] + if self.normalize_before: + xs = self.after_norm(xs) + + return xs, masks + + +# encoder and decoder is nn.Layer, not str +class MLM(nn.Layer): + def __init__(self, + odim: int, + encoder: nn.Layer, + decoder: Optional[nn.Layer], + postnet_layers: int=0, + postnet_chans: int=0, + postnet_filts: int=0, + text_masking: bool=False): + + super().__init__() + self.odim = odim + self.encoder = encoder + self.decoder = decoder + self.vocab_size = encoder.text_embed[0]._num_embeddings + + if self.decoder is None or not (hasattr(self.decoder, + 'output_layer') and + self.decoder.output_layer is not None): + self.sfc = nn.Linear(self.encoder._output_size, odim) + else: + self.sfc = None + if text_masking: + self.text_sfc = nn.Linear( + self.encoder.text_embed[0]._embedding_dim, + self.vocab_size, + weight_attr=self.encoder.text_embed[0]._weight_attr) + else: + self.text_sfc = None + + self.postnet = (None if postnet_layers == 0 else Postnet( + idim=self.encoder._output_size, + odim=odim, + n_layers=postnet_layers, + n_chans=postnet_chans, + n_filts=postnet_filts, + use_batch_norm=True, + dropout_rate=0.5, )) + + def inference( + self, + speech: paddle.Tensor, + text: paddle.Tensor, + masked_pos: paddle.Tensor, + speech_mask: paddle.Tensor, + text_mask: paddle.Tensor, + speech_seg_pos: paddle.Tensor, + text_seg_pos: paddle.Tensor, + span_bdy: List[int], + use_teacher_forcing: bool=False, ) -> List[paddle.Tensor]: + ''' + Args: + speech (paddle.Tensor): input speech (1, Tmax, D). + text (paddle.Tensor): input text (1, Tmax2). + masked_pos (paddle.Tensor): masked position of input speech (1, Tmax) + speech_mask (paddle.Tensor): mask of speech (1, 1, Tmax). + text_mask (paddle.Tensor): mask of text (1, 1, Tmax2). + speech_seg_pos (paddle.Tensor): n-th phone of each mel, 0<=n<=Tmax2 (1, Tmax). + text_seg_pos (paddle.Tensor): n-th phone of each phone, 0<=n<=Tmax2 (1, Tmax2). + span_bdy (List[int]): masked mel boundary of input speech (2,) + use_teacher_forcing (bool): whether to use teacher forcing + Returns: + List[Tensor]: + eg: + [Tensor(shape=[1, 181, 80]), Tensor(shape=[80, 80]), Tensor(shape=[1, 67, 80])] + ''' + + z_cache = None + if use_teacher_forcing: + before_outs, zs, *_ = self.forward( + speech=speech, + text=text, + masked_pos=masked_pos, + speech_mask=speech_mask, + text_mask=text_mask, + speech_seg_pos=speech_seg_pos, + text_seg_pos=text_seg_pos) + if zs is None: + zs = before_outs + + speech = speech.squeeze(0) + outs = [speech[:span_bdy[0]]] + outs += [zs[0][span_bdy[0]:span_bdy[1]]] + outs += [speech[span_bdy[1]:]] + return outs + return None + + +class MLMEncAsDecoder(MLM): + def forward(self, + speech: paddle.Tensor, + text: paddle.Tensor, + masked_pos: paddle.Tensor, + speech_mask: paddle.Tensor, + text_mask: paddle.Tensor, + speech_seg_pos: paddle.Tensor, + text_seg_pos: paddle.Tensor): + # feats: (Batch, Length, Dim) + # -> encoder_out: (Batch, Length2, Dim2) + encoder_out, h_masks = self.encoder( + speech=speech, + text=text, + masked_pos=masked_pos, + speech_mask=speech_mask, + text_mask=text_mask, + speech_seg_pos=speech_seg_pos, + text_seg_pos=text_seg_pos) + if self.decoder is not None: + zs, _ = self.decoder(encoder_out, h_masks) + else: + zs = encoder_out + speech_hidden_states = zs[:, :paddle.shape(speech)[1], :] + if self.sfc is not None: + before_outs = paddle.reshape( + self.sfc(speech_hidden_states), + (paddle.shape(speech_hidden_states)[0], -1, self.odim)) + else: + before_outs = speech_hidden_states + if self.postnet is not None: + after_outs = before_outs + paddle.transpose( + self.postnet(paddle.transpose(before_outs, [0, 2, 1])), + [0, 2, 1]) + else: + after_outs = None + return before_outs, after_outs, None + + +class MLMDualMaksing(MLM): + def forward(self, + speech: paddle.Tensor, + text: paddle.Tensor, + masked_pos: paddle.Tensor, + speech_mask: paddle.Tensor, + text_mask: paddle.Tensor, + speech_seg_pos: paddle.Tensor, + text_seg_pos: paddle.Tensor): + # feats: (Batch, Length, Dim) + # -> encoder_out: (Batch, Length2, Dim2) + encoder_out, h_masks = self.encoder( + speech=speech, + text=text, + masked_pos=masked_pos, + speech_mask=speech_mask, + text_mask=text_mask, + speech_seg_pos=speech_seg_pos, + text_seg_pos=text_seg_pos) + if self.decoder is not None: + zs, _ = self.decoder(encoder_out, h_masks) + else: + zs = encoder_out + speech_hidden_states = zs[:, :paddle.shape(speech)[1], :] + if self.text_sfc: + text_hiddent_states = zs[:, paddle.shape(speech)[1]:, :] + text_outs = paddle.reshape( + self.text_sfc(text_hiddent_states), + (paddle.shape(text_hiddent_states)[0], -1, self.vocab_size)) + if self.sfc is not None: + before_outs = paddle.reshape( + self.sfc(speech_hidden_states), + (paddle.shape(speech_hidden_states)[0], -1, self.odim)) + else: + before_outs = speech_hidden_states + if self.postnet is not None: + after_outs = before_outs + paddle.transpose( + self.postnet(paddle.transpose(before_outs, [0, 2, 1])), + [0, 2, 1]) + else: + after_outs = None + return before_outs, after_outs, text_outs + + +class ErnieSAT(nn.Layer): + def __init__( + self, + # network structure related + idim: int, + odim: int, + postnet_layers: int=5, + postnet_filts: int=5, + postnet_chans: int=256, + use_scaled_pos_enc: bool=False, + encoder_type: str='conformer', + decoder_type: str='conformer', + enc_input_layer: str='sega_mlm', + enc_pre_speech_layer: int=0, + enc_cnn_module_kernel: int=7, + enc_attention_dim: int=384, + enc_attention_heads: int=2, + enc_linear_units: int=1536, + enc_num_blocks: int=4, + enc_dropout_rate: float=0.2, + enc_positional_dropout_rate: float=0.2, + enc_attention_dropout_rate: float=0.2, + enc_normalize_before: bool=True, + enc_macaron_style: bool=True, + enc_use_cnn_module: bool=True, + enc_selfattention_layer_type: str='legacy_rel_selfattn', + enc_activation_type: str='swish', + enc_pos_enc_layer_type: str='legacy_rel_pos', + enc_positionwise_layer_type: str='conv1d', + enc_positionwise_conv_kernel_size: int=3, + text_masking: bool=False, + dec_cnn_module_kernel: int=31, + dec_attention_dim: int=384, + dec_attention_heads: int=2, + dec_linear_units: int=1536, + dec_num_blocks: int=4, + dec_dropout_rate: float=0.2, + dec_positional_dropout_rate: float=0.2, + dec_attention_dropout_rate: float=0.2, + dec_macaron_style: bool=True, + dec_use_cnn_module: bool=True, + dec_selfattention_layer_type: str='legacy_rel_selfattn', + dec_activation_type: str='swish', + dec_pos_enc_layer_type: str='legacy_rel_pos', + dec_positionwise_layer_type: str='conv1d', + dec_positionwise_conv_kernel_size: int=3, + init_type: str="xavier_uniform", ): + super().__init__() + # store hyperparameters + self.odim = odim + + self.use_scaled_pos_enc = use_scaled_pos_enc + + # initialize parameters + initialize(self, init_type) + + # Encoder + if encoder_type == "conformer": + encoder = MLMEncoder( + idim=odim, + vocab_size=idim, + pre_speech_layer=enc_pre_speech_layer, + attention_dim=enc_attention_dim, + attention_heads=enc_attention_heads, + linear_units=enc_linear_units, + num_blocks=enc_num_blocks, + dropout_rate=enc_dropout_rate, + positional_dropout_rate=enc_positional_dropout_rate, + attention_dropout_rate=enc_attention_dropout_rate, + input_layer=enc_input_layer, + normalize_before=enc_normalize_before, + positionwise_layer_type=enc_positionwise_layer_type, + positionwise_conv_kernel_size=enc_positionwise_conv_kernel_size, + macaron_style=enc_macaron_style, + pos_enc_layer_type=enc_pos_enc_layer_type, + selfattention_layer_type=enc_selfattention_layer_type, + activation_type=enc_activation_type, + use_cnn_module=enc_use_cnn_module, + cnn_module_kernel=enc_cnn_module_kernel, + text_masking=text_masking) + else: + raise ValueError(f"{encoder_type} is not supported.") + + # Decoder + if decoder_type != 'no_decoder': + decoder = MLMDecoder( + idim=0, + input_layer=None, + cnn_module_kernel=dec_cnn_module_kernel, + attention_dim=dec_attention_dim, + attention_heads=dec_attention_heads, + linear_units=dec_linear_units, + num_blocks=dec_num_blocks, + dropout_rate=dec_dropout_rate, + positional_dropout_rate=dec_positional_dropout_rate, + macaron_style=dec_macaron_style, + use_cnn_module=dec_use_cnn_module, + selfattention_layer_type=dec_selfattention_layer_type, + activation_type=dec_activation_type, + pos_enc_layer_type=dec_pos_enc_layer_type, + positionwise_layer_type=dec_positionwise_layer_type, + positionwise_conv_kernel_size=dec_positionwise_conv_kernel_size) + + else: + decoder = None + + model_class = MLMDualMaksing if text_masking else MLMEncAsDecoder + + self.model = model_class( + odim=odim, + encoder=encoder, + decoder=decoder, + postnet_layers=postnet_layers, + postnet_filts=postnet_filts, + postnet_chans=postnet_chans, + text_masking=text_masking) + + nn.initializer.set_global_initializer(None) + + def forward(self, + speech: paddle.Tensor, + text: paddle.Tensor, + masked_pos: paddle.Tensor, + speech_mask: paddle.Tensor, + text_mask: paddle.Tensor, + speech_seg_pos: paddle.Tensor, + text_seg_pos: paddle.Tensor): + return self.model( + speech=speech, + text=text, + masked_pos=masked_pos, + speech_mask=speech_mask, + text_mask=text_mask, + speech_seg_pos=speech_seg_pos, + text_seg_pos=text_seg_pos) + + def inference( + self, + speech: paddle.Tensor, + text: paddle.Tensor, + masked_pos: paddle.Tensor, + speech_mask: paddle.Tensor, + text_mask: paddle.Tensor, + speech_seg_pos: paddle.Tensor, + text_seg_pos: paddle.Tensor, + span_bdy: List[int], + use_teacher_forcing: bool=False, ) -> Dict[str, paddle.Tensor]: + return self.model.inference( + speech=speech, + text=text, + masked_pos=masked_pos, + speech_mask=speech_mask, + text_mask=text_mask, + speech_seg_pos=speech_seg_pos, + text_seg_pos=text_seg_pos, + span_bdy=span_bdy, + use_teacher_forcing=use_teacher_forcing) + + +class ErnieSATInference(nn.Layer): + def __init__(self, normalizer, model): + super().__init__() + self.normalizer = normalizer + self.acoustic_model = model + + def forward( + self, + speech: paddle.Tensor, + text: paddle.Tensor, + masked_pos: paddle.Tensor, + speech_mask: paddle.Tensor, + text_mask: paddle.Tensor, + speech_seg_pos: paddle.Tensor, + text_seg_pos: paddle.Tensor, + span_bdy: List[int], + use_teacher_forcing: bool=True, ): + outs = self.acoustic_model.inference( + speech=speech, + text=text, + masked_pos=masked_pos, + speech_mask=speech_mask, + text_mask=text_mask, + speech_seg_pos=speech_seg_pos, + text_seg_pos=text_seg_pos, + span_bdy=span_bdy, + use_teacher_forcing=use_teacher_forcing) + + normed_mel_pre, normed_mel_masked, normed_mel_post = outs + logmel_pre = self.normalizer.inverse(normed_mel_pre) + logmel_masked = self.normalizer.inverse(normed_mel_masked) + logmel_post = self.normalizer.inverse(normed_mel_post) + return logmel_pre, logmel_masked, logmel_post diff --git a/paddlespeech/t2s/models/ernie_sat/ernie_sat_updater.py b/paddlespeech/t2s/models/ernie_sat/ernie_sat_updater.py new file mode 100644 index 0000000000000000000000000000000000000000..219341c8860f1b7e85ea1136523bb6d54e40f4ce --- /dev/null +++ b/paddlespeech/t2s/models/ernie_sat/ernie_sat_updater.py @@ -0,0 +1,158 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import logging +from pathlib import Path + +from paddle import distributed as dist +from paddle.io import DataLoader +from paddle.nn import Layer +from paddle.optimizer import Optimizer +from paddle.optimizer.lr import LRScheduler + +from paddlespeech.t2s.modules.losses import MLMLoss +from paddlespeech.t2s.training.extensions.evaluator import StandardEvaluator +from paddlespeech.t2s.training.reporter import report +from paddlespeech.t2s.training.updaters.standard_updater import StandardUpdater +logging.basicConfig( + format='%(asctime)s [%(levelname)s] [%(filename)s:%(lineno)d] %(message)s', + datefmt='[%Y-%m-%d %H:%M:%S]') +logger = logging.getLogger(__name__) +logger.setLevel(logging.INFO) + + +class ErnieSATUpdater(StandardUpdater): + def __init__(self, + model: Layer, + optimizer: Optimizer, + scheduler: LRScheduler, + dataloader: DataLoader, + init_state=None, + text_masking: bool=False, + odim: int=80, + vocab_size: int=100, + output_dir: Path=None): + super().__init__(model, optimizer, dataloader, init_state=None) + self.scheduler = scheduler + + self.criterion = MLMLoss( + text_masking=text_masking, odim=odim, vocab_size=vocab_size) + + log_file = output_dir / 'worker_{}.log'.format(dist.get_rank()) + self.filehandler = logging.FileHandler(str(log_file)) + logger.addHandler(self.filehandler) + self.logger = logger + self.msg = "" + + def update_core(self, batch): + self.msg = "Rank: {}, ".format(dist.get_rank()) + losses_dict = {} + + before_outs, after_outs, text_outs = self.model( + speech=batch["speech"], + text=batch["text"], + masked_pos=batch["masked_pos"], + speech_mask=batch["speech_mask"], + text_mask=batch["text_mask"], + speech_seg_pos=batch["speech_seg_pos"], + text_seg_pos=batch["text_seg_pos"]) + + mlm_loss, text_mlm_loss = self.criterion( + speech=batch["speech"], + before_outs=before_outs, + after_outs=after_outs, + masked_pos=batch["masked_pos"], + text=batch["text"], + # maybe None + text_outs=text_outs, + # maybe None + text_masked_pos=batch["text_masked_pos"]) + + loss = mlm_loss + text_mlm_loss if text_mlm_loss is not None else mlm_loss + + self.optimizer.clear_grad() + + loss.backward() + self.optimizer.step() + self.scheduler.step() + scheduler_msg = 'lr: {}'.format(self.scheduler.last_lr) + + report("train/loss", float(loss)) + report("train/mlm_loss", float(mlm_loss)) + if text_mlm_loss is not None: + report("train/text_mlm_loss", float(text_mlm_loss)) + losses_dict["text_mlm_loss"] = float(text_mlm_loss) + + losses_dict["mlm_loss"] = float(mlm_loss) + losses_dict["loss"] = float(loss) + self.msg += ', '.join('{}: {:>.6f}'.format(k, v) + for k, v in losses_dict.items()) + self.msg += ', ' + scheduler_msg + + +class ErnieSATEvaluator(StandardEvaluator): + def __init__(self, + model: Layer, + dataloader: DataLoader, + text_masking: bool=False, + odim: int=80, + vocab_size: int=100, + output_dir: Path=None): + super().__init__(model, dataloader) + + log_file = output_dir / 'worker_{}.log'.format(dist.get_rank()) + self.filehandler = logging.FileHandler(str(log_file)) + logger.addHandler(self.filehandler) + self.logger = logger + self.msg = "" + + self.criterion = MLMLoss( + text_masking=text_masking, odim=odim, vocab_size=vocab_size) + + def evaluate_core(self, batch): + self.msg = "Evaluate: " + losses_dict = {} + + before_outs, after_outs, text_outs = self.model( + speech=batch["speech"], + text=batch["text"], + masked_pos=batch["masked_pos"], + speech_mask=batch["speech_mask"], + text_mask=batch["text_mask"], + speech_seg_pos=batch["speech_seg_pos"], + text_seg_pos=batch["text_seg_pos"]) + + mlm_loss, text_mlm_loss = self.criterion( + speech=batch["speech"], + before_outs=before_outs, + after_outs=after_outs, + masked_pos=batch["masked_pos"], + text=batch["text"], + # maybe None + text_outs=text_outs, + # maybe None + text_masked_pos=batch["text_masked_pos"]) + loss = mlm_loss + text_mlm_loss if text_mlm_loss is not None else mlm_loss + + report("eval/loss", float(loss)) + report("eval/mlm_loss", float(mlm_loss)) + if text_mlm_loss is not None: + report("eval/text_mlm_loss", float(text_mlm_loss)) + losses_dict["text_mlm_loss"] = float(text_mlm_loss) + + losses_dict["mlm_loss"] = float(mlm_loss) + losses_dict["loss"] = float(loss) + + self.msg += ', '.join('{}: {:>.6f}'.format(k, v) + for k, v in losses_dict.items()) + self.logger.info(self.msg) diff --git a/paddlespeech/t2s/models/ernie_sat/mlm.py b/paddlespeech/t2s/models/ernie_sat/mlm.py index c9c3d67a6a37bcd8121d1bc829fb88423fc5795e..647fdd9b403ce76d7755021a9823850e5b845e42 100644 --- a/paddlespeech/t2s/models/ernie_sat/mlm.py +++ b/paddlespeech/t2s/models/ernie_sat/mlm.py @@ -1,9 +1,20 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. import argparse from typing import Dict from typing import List from typing import Optional -from typing import Tuple -from typing import Union import paddle import yaml @@ -109,7 +120,6 @@ class MLMEncoder(nn.Layer): positionwise_conv_kernel_size: int=1, macaron_style: bool=False, pos_enc_layer_type: str="abs_pos", - pos_enc_class=None, selfattention_layer_type: str="selfattn", activation_type: str="swish", use_cnn_module: bool=False, @@ -334,7 +344,6 @@ class MLMDecoder(MLMEncoder): # encoder and decoder is nn.Layer, not str class MLM(nn.Layer): def __init__(self, - token_list: Union[Tuple[str, ...], List[str]], odim: int, encoder: nn.Layer, decoder: Optional[nn.Layer], @@ -345,7 +354,6 @@ class MLM(nn.Layer): super().__init__() self.odim = odim - self.token_list = token_list.copy() self.encoder = encoder self.decoder = decoder self.vocab_size = encoder.text_embed[0]._num_embeddings @@ -535,32 +543,6 @@ def build_model(args: argparse.Namespace, model_class=MLMEncAsDecoder) -> MLM: vocab_size = len(token_list) odim = 80 - pos_enc_class = ScaledPositionalEncoding if args.use_scaled_pos_enc else PositionalEncoding - - if "conformer" == args.encoder: - conformer_self_attn_layer_type = args.encoder_conf[ - 'selfattention_layer_type'] - conformer_pos_enc_layer_type = args.encoder_conf['pos_enc_layer_type'] - conformer_rel_pos_type = "legacy" - if conformer_rel_pos_type == "legacy": - if conformer_pos_enc_layer_type == "rel_pos": - conformer_pos_enc_layer_type = "legacy_rel_pos" - if conformer_self_attn_layer_type == "rel_selfattn": - conformer_self_attn_layer_type = "legacy_rel_selfattn" - elif conformer_rel_pos_type == "latest": - assert conformer_pos_enc_layer_type != "legacy_rel_pos" - assert conformer_self_attn_layer_type != "legacy_rel_selfattn" - else: - raise ValueError(f"Unknown rel_pos_type: {conformer_rel_pos_type}") - args.encoder_conf[ - 'selfattention_layer_type'] = conformer_self_attn_layer_type - args.encoder_conf['pos_enc_layer_type'] = conformer_pos_enc_layer_type - if "conformer" == args.decoder: - args.decoder_conf[ - 'selfattention_layer_type'] = conformer_self_attn_layer_type - args.decoder_conf[ - 'pos_enc_layer_type'] = conformer_pos_enc_layer_type - # Encoder encoder_class = MLMEncoder @@ -571,10 +553,7 @@ def build_model(args: argparse.Namespace, model_class=MLMEncAsDecoder) -> MLM: args.encoder_conf['text_masking'] = False encoder = encoder_class( - args.input_size, - vocab_size=vocab_size, - pos_enc_class=pos_enc_class, - **args.encoder_conf) + args.input_size, vocab_size=vocab_size, **args.encoder_conf) # Decoder if args.decoder != 'no_decoder': @@ -591,7 +570,6 @@ def build_model(args: argparse.Namespace, model_class=MLMEncAsDecoder) -> MLM: odim=odim, encoder=encoder, decoder=decoder, - token_list=token_list, **args.model_conf, ) # Initialize diff --git a/paddlespeech/t2s/models/fastspeech2/fastspeech2.py b/paddlespeech/t2s/models/fastspeech2/fastspeech2.py index 347a10e90a2b80d09ddee07d14a19db12377c4be..9905765dbcd39de7854f0324ec0b3d932ad352b4 100644 --- a/paddlespeech/t2s/models/fastspeech2/fastspeech2.py +++ b/paddlespeech/t2s/models/fastspeech2/fastspeech2.py @@ -274,9 +274,7 @@ class FastSpeech2(nn.Layer): super().__init__() # store hyperparameters - self.idim = idim self.odim = odim - self.eos = idim - 1 self.reduction_factor = reduction_factor self.encoder_type = encoder_type self.decoder_type = decoder_type diff --git a/paddlespeech/t2s/modules/losses.py b/paddlespeech/t2s/modules/losses.py index b2a31a32145afb981bff432579ddc513b937bd6f..1a43f5ef31fc7b8e555a666e0d59ca97dcac13ec 100644 --- a/paddlespeech/t2s/modules/losses.py +++ b/paddlespeech/t2s/modules/losses.py @@ -1068,6 +1068,8 @@ class KLDivergenceLoss(nn.Layer): # loss for ERNIE SAT class MLMLoss(nn.Layer): def __init__(self, + odim: int, + vocab_size: int=0, lsm_weight: float=0.1, ignore_id: int=-1, text_masking: bool=False): @@ -1079,15 +1081,19 @@ class MLMLoss(nn.Layer): else: self.l1_loss_func = nn.L1Loss(reduction='none') self.text_masking = text_masking + self.odim = odim + self.vocab_size = vocab_size - def forward(self, - speech: paddle.Tensor, - before_outs: paddle.Tensor, - after_outs: paddle.Tensor, - masked_pos: paddle.Tensor, - text: paddle.Tensor=None, - text_outs: paddle.Tensor=None, - text_masked_pos: paddle.Tensor=None): + def forward( + self, + speech: paddle.Tensor, + before_outs: paddle.Tensor, + after_outs: paddle.Tensor, + masked_pos: paddle.Tensor, + # for text_loss when text_masking == True + text: paddle.Tensor=None, + text_outs: paddle.Tensor=None, + text_masked_pos: paddle.Tensor=None): xs_pad = speech mlm_loss_pos = masked_pos > 0 @@ -1102,16 +1108,21 @@ class MLMLoss(nn.Layer): paddle.reshape(after_outs, (-1, self.odim)), paddle.reshape(xs_pad, (-1, self.odim))), axis=-1) - loss_mlm = paddle.sum((loss * paddle.reshape( + mlm_loss = paddle.sum((loss * paddle.reshape( mlm_loss_pos, [-1]))) / paddle.sum((mlm_loss_pos) + 1e-10) - if self.text_masking: - loss_text = paddle.sum((self.text_mlm_loss( - paddle.reshape(text_outs, (-1, self.vocab_size)), - paddle.reshape(text, (-1))) * paddle.reshape( - text_masked_pos, - (-1)))) / paddle.sum((text_masked_pos) + 1e-10) + text_mlm_loss = None - return loss_mlm, loss_text - - return loss_mlm + if self.text_masking: + assert text is not None + assert text_outs is not None + assert text_masked_pos is not None + text_outs = paddle.reshape(text_outs, [-1, self.vocab_size]) + text = paddle.reshape(text, [-1]) + text_mlm_loss = self.text_mlm_loss(text_outs, text) + text_masked_pos_reshape = paddle.reshape(text_masked_pos, [-1]) + text_mlm_loss = paddle.sum( + text_mlm_loss * + text_masked_pos_reshape) / paddle.sum((text_masked_pos) + 1e-10) + + return mlm_loss, text_mlm_loss diff --git a/paddlespeech/t2s/modules/nets_utils.py b/paddlespeech/t2s/modules/nets_utils.py index a3d5d1354f114952b784ba16b71c9344ef28c9d8..798e4dee88c7008f7c95fe3082890d357bcadd13 100644 --- a/paddlespeech/t2s/modules/nets_utils.py +++ b/paddlespeech/t2s/modules/nets_utils.py @@ -418,7 +418,6 @@ def phones_masking(xs_pad: paddle.Tensor, mean_phn_span=mean_phn_span).nonzero() masked_start = align_start[idx][masked_phn_idxs].tolist() masked_end = align_end[idx][masked_phn_idxs].tolist() - for s, e in zip(masked_start, masked_end): masked_pos[idx, s:e] = 1 non_eos_mask = paddle.reshape(src_mask, paddle.shape(xs_pad)[:2]) @@ -500,14 +499,15 @@ def phones_text_masking(xs_pad: paddle.Tensor, set(range(length)) - set(masked_phn_idxs[0].tolist())) np.random.shuffle(unmasked_phn_idxs) masked_text_idxs = unmasked_phn_idxs[:text_mask_num_lower] - text_masked_pos[idx][masked_text_idxs] = 1 + text_masked_pos[idx, masked_text_idxs] = 1 masked_start = align_start[idx][masked_phn_idxs].tolist() masked_end = align_end[idx][masked_phn_idxs].tolist() for s, e in zip(masked_start, masked_end): masked_pos[idx, s:e] = 1 - non_eos_mask = paddle.reshape(src_mask, paddle.shape(xs_pad)[:2]) + non_eos_mask = paddle.reshape(src_mask, shape=paddle.shape(xs_pad)[:2]) masked_pos = masked_pos * non_eos_mask - non_eos_text_mask = paddle.reshape(text_mask, paddle.shape(xs_pad)[:2]) + non_eos_text_mask = paddle.reshape( + text_mask, shape=paddle.shape(text_pad)[:2]) text_masked_pos = text_masked_pos * non_eos_text_mask masked_pos = paddle.cast(masked_pos, 'bool') text_masked_pos = paddle.cast(text_masked_pos, 'bool') diff --git a/tests/unit/cli/test_cli.sh b/tests/unit/cli/test_cli.sh index 6879c4d6492557e5d9fb84a5ddbd1b7076ac340b..4d2ed1b85b41b7846ec98b25ae391f2aa1500849 100755 --- a/tests/unit/cli/test_cli.sh +++ b/tests/unit/cli/test_cli.sh @@ -43,7 +43,7 @@ paddlespeech tts --input "你好,欢迎使用百度飞桨深度学习框架! paddlespeech tts --am speedyspeech_csmsc --input "你好,欢迎使用百度飞桨深度学习框架!" paddlespeech tts --voc mb_melgan_csmsc --input "你好,欢迎使用百度飞桨深度学习框架!" paddlespeech tts --voc style_melgan_csmsc --input "你好,欢迎使用百度飞桨深度学习框架!" -paddlespeech tts --voc hifigan_csmsc --input "你好,欢迎使用百度飞桨深度学习框架!" +paddlespeech tts --voc pwgan_csmsc --input "你好,欢迎使用百度飞桨深度学习框架!" paddlespeech tts --am fastspeech2_aishell3 --voc pwgan_aishell3 --input "你好,欢迎使用百度飞桨深度学习框架!" --spk_id 0 paddlespeech tts --am fastspeech2_aishell3 --voc hifigan_aishell3 --input "你好,欢迎使用百度飞桨深度学习框架!" --spk_id 0 paddlespeech tts --am fastspeech2_ljspeech --voc pwgan_ljspeech --lang en --input "Life was like a box of chocolates, you never know what you're gonna get." @@ -53,6 +53,12 @@ paddlespeech tts --am fastspeech2_vctk --voc hifigan_vctk --input "Life was like paddlespeech tts --am tacotron2_csmsc --input "你好,欢迎使用百度飞桨深度学习框架!" paddlespeech tts --am tacotron2_csmsc --voc wavernn_csmsc --input "你好,欢迎使用百度飞桨深度学习框架!" paddlespeech tts --am tacotron2_ljspeech --voc pwgan_ljspeech --lang en --input "Life was like a box of chocolates, you never know what you're gonna get." +# mix tts +# The `am` must be `fastspeech2_mix`! +# The `lang` must be `mix`! +# The voc must be `hifigan_ljspeech` or `pwgan_ljspeech` for f`astspeech2_mix` now! +paddlespeech tts --am fastspeech2_mix --voc hifigan_ljspeech --lang mix --input "热烈欢迎您在 Discussions 中提交问题,并在 Issues 中指出发现的 bug。此外,我们非常希望您参与到 Paddle Speech 的开发中!" --spk_id 0 --output mix_spk0.wav +paddlespeech tts --am fastspeech2_mix --voc pwgan_ljspeech --lang mix --input "我们的声学模型使用了 Fast Speech Two, 声码器使用了 Parallel Wave GAN and Hifi GAN." --spk_id 1 --output mix_spk1.wav # Speech Translation (only support linux) paddlespeech st --input ./en.wav diff --git a/third_party/README.md b/third_party/README.md index 843d0d3b2e69a573195c669ac0bbee8f52558960..98e03b0a346d7d05c9301c5160f7999e7d4574c6 100644 --- a/third_party/README.md +++ b/third_party/README.md @@ -1,27 +1,26 @@ -* [python_kaldi_features](https://github.com/ZitengWang/python_kaldi_features) +# python_kaldi_features + +[python_kaldi_features](https://github.com/ZitengWang/python_kaldi_features) commit: fc1bd6240c2008412ab64dc25045cd872f5e126c ref: https://zhuanlan.zhihu.com/p/55371926 license: MIT -* [python-pinyin](https://github.com/mozillazg/python-pinyin.git) -commit: 55e524aa1b7b8eec3d15c5306043c6cdd5938b03 -license: MIT +# Install ctc_decoder for Windows -* [zhon](https://github.com/tsroten/zhon) -commit: 09bf543696277f71de502506984661a60d24494c -license: MIT +`install_win_ctc.bat` is bat script to install paddlespeech_ctc_decoders for windows -* [pymmseg-cpp](https://github.com/pluskid/pymmseg-cpp.git) -commit: b76465045717fbb4f118c4fbdd24ce93bab10a6d -license: MIT +## Prepare your environment -* [chinese_text_normalization](https://github.com/speechio/chinese_text_normalization.git) -commit: 9e92c7bf2d6b5a7974305406d8e240045beac51c -license: MIT +insure your environment like this: -* [phkit](https://github.com/KuangDD/phkit.git) -commit: b2100293c1e36da531d7f30bd52c9b955a649522 -license: None +* gcc: version >= 12.1.0 +* cmake: version >= 3.24.0 +* make: version >= 3.82.90 +* visual studio: version >= 2019 -* [nnAudio](https://github.com/KinWaiCheuk/nnAudio.git) -license: MIT +## Start your bat script + +```shell +start install_win_ctc.bat + +``` diff --git a/third_party/ctc_decoders/scorer.cpp b/third_party/ctc_decoders/scorer.cpp index 6c1d96be36c5332b0a6492589a44e63b9a1e91b2..6e7f68cf6baa89fa91b3055d04611dbd930ade53 100644 --- a/third_party/ctc_decoders/scorer.cpp +++ b/third_party/ctc_decoders/scorer.cpp @@ -13,7 +13,8 @@ #include "decoder_utils.h" using namespace lm::ngram; - +// if your platform is windows ,you need add the define +#define F_OK 0 Scorer::Scorer(double alpha, double beta, const std::string& lm_path, diff --git a/third_party/ctc_decoders/setup.py b/third_party/ctc_decoders/setup.py index ce2787e3fa5531b19976c5cd7fa70c56e49caf71..9a8b292a07b9b5576bcff3a7b7a51c41d4a3a471 100644 --- a/third_party/ctc_decoders/setup.py +++ b/third_party/ctc_decoders/setup.py @@ -89,10 +89,11 @@ FILES = [ or fn.endswith('unittest.cc')) ] # yapf: enable - LIBS = ['stdc++'] if platform.system() != 'Darwin': LIBS.append('rt') +if platform.system() == 'Windows': + LIBS = ['-static-libstdc++'] ARGS = ['-O3', '-DNDEBUG', '-DKENLM_MAX_ORDER=6', '-std=c++11'] diff --git a/third_party/install_win_ctc.bat b/third_party/install_win_ctc.bat new file mode 100644 index 0000000000000000000000000000000000000000..0bf1e7bb127246c101f43d00b257740b2fbef7c5 --- /dev/null +++ b/third_party/install_win_ctc.bat @@ -0,0 +1,21 @@ +@echo off + +cd ctc_decoders +if not exist kenlm ( + git clone https://github.com/Doubledongli/kenlm.git + @echo. +) + +if not exist openfst-1.6.3 ( + echo "Download and extract openfst ..." + git clone https://gitee.com/koala999/openfst.git + ren openfst openfst-1.6.3 + @echo. +) + +if not exist ThreadPool ( + git clone https://github.com/progschj/ThreadPool.git + @echo. +) +echo "Install decoders ..." +python setup.py install --num_processes 4 \ No newline at end of file