提交 812d80ab 编写于 作者: H Hui Zhang

Merge branch 'develop' of https://github.com/PaddlePaddle/DeepSpeech into u2_export

...@@ -20,7 +20,8 @@ ...@@ -20,7 +20,8 @@
</p> </p>
<div align="center"> <div align="center">
<h4> <h4>
<a href="#快速开始"> 快速开始 </a> <a href="#安装"> 安装 </a>
| <a href="#快速开始"> 快速开始 </a>
| <a href="#快速使用服务"> 快速使用服务 </a> | <a href="#快速使用服务"> 快速使用服务 </a>
| <a href="#快速使用流式服务"> 快速使用流式服务 </a> | <a href="#快速使用流式服务"> 快速使用流式服务 </a>
| <a href="#教程文档"> 教程文档 </a> | <a href="#教程文档"> 教程文档 </a>
...@@ -38,6 +39,8 @@ ...@@ -38,6 +39,8 @@
**PaddleSpeech** 荣获 [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/), 请访问 [Arxiv](https://arxiv.org/abs/2205.12007) 论文。 **PaddleSpeech** 荣获 [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/), 请访问 [Arxiv](https://arxiv.org/abs/2205.12007) 论文。
### 效果展示
##### 语音识别 ##### 语音识别
<div align = "center"> <div align = "center">
...@@ -154,7 +157,7 @@ ...@@ -154,7 +157,7 @@
本项目采用了易用、高效、灵活以及可扩展的实现,旨在为工业应用、学术研究提供更好的支持,实现的功能包含训练、推断以及测试模块,以及部署过程,主要包括 本项目采用了易用、高效、灵活以及可扩展的实现,旨在为工业应用、学术研究提供更好的支持,实现的功能包含训练、推断以及测试模块,以及部署过程,主要包括
- 📦 **易用性**: 安装门槛低,可使用 [CLI](#quick-start) 快速开始。 - 📦 **易用性**: 安装门槛低,可使用 [CLI](#quick-start) 快速开始。
- 🏆 **对标 SoTA**: 提供了高速、轻量级模型,且借鉴了最前沿的技术。 - 🏆 **对标 SoTA**: 提供了高速、轻量级模型,且借鉴了最前沿的技术。
- 🏆 **流式ASR和TTS系统**:工业级的端到端流式识别、流式合成系统。 - 🏆 **流式 ASR 和 TTS 系统**:工业级的端到端流式识别、流式合成系统。
- 💯 **基于规则的中文前端**: 我们的前端包含文本正则化和字音转换(G2P)。此外,我们使用自定义语言规则来适应中文语境。 - 💯 **基于规则的中文前端**: 我们的前端包含文本正则化和字音转换(G2P)。此外,我们使用自定义语言规则来适应中文语境。
- **多种工业界以及学术界主流功能支持**: - **多种工业界以及学术界主流功能支持**:
- 🛎️ 典型音频任务: 本工具包提供了音频任务如音频分类、语音翻译、自动语音识别、文本转语音、语音合成、声纹识别、KWS等任务的实现。 - 🛎️ 典型音频任务: 本工具包提供了音频任务如音频分类、语音翻译、自动语音识别、文本转语音、语音合成、声纹识别、KWS等任务的实现。
...@@ -182,61 +185,195 @@ ...@@ -182,61 +185,195 @@
<img src="https://user-images.githubusercontent.com/23690325/169763015-cbd8e28d-602c-4723-810d-dbc6da49441e.jpg" width = "200" /> <img src="https://user-images.githubusercontent.com/23690325/169763015-cbd8e28d-602c-4723-810d-dbc6da49441e.jpg" width = "200" />
</div> </div>
<a name="安装"></a>
## 安装 ## 安装
我们强烈建议用户在 **Linux** 环境下,*3.7* 以上版本的 *python* 上安装 PaddleSpeech。 我们强烈建议用户在 **Linux** 环境下,*3.7* 以上版本的 *python* 上安装 PaddleSpeech。
目前为止,**Linux** 支持声音分类、语音识别、语音合成和语音翻译四种功能,**Mac OSX、 Windows** 下暂不支持语音翻译功能。 想了解具体安装细节,可以参考[安装文档](./docs/source/install_cn.md)
### 相关依赖
+ gcc >= 4.8.5
+ paddlepaddle >= 2.3.1
+ python >= 3.7
+ linux(推荐), mac, windows
PaddleSpeech依赖于paddlepaddle,安装可以参考[paddlepaddle官网](https://www.paddlepaddle.org.cn/),根据自己机器的情况进行选择。这里给出cpu版本示例,其它版本大家可以根据自己机器的情况进行安装。
```shell
pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple
```
PaddleSpeech快速安装方式有两种,一种是pip安装,一种是源码编译(推荐)。
### pip 安装
```shell
pip install pytest-runner
pip install paddlespeech
```
### 源码编译
```shell
git clone https://github.com/PaddlePaddle/PaddleSpeech.git
cd PaddleSpeech
pip install pytest-runner
pip install .
```
更多关于安装问题,如 conda 环境,librosa 依赖的系统库,gcc 环境问题,kaldi 安装等,可以参考这篇[安装文档](docs/source/install_cn.md),如安装上遇到问题可以在 [#2150](https://github.com/PaddlePaddle/PaddleSpeech/issues/2150) 上留言以及查找相关问题
<a name="快速开始"></a> <a name="快速开始"></a>
## 快速开始 ## 快速开始
安装完成后,开发者可以通过命令行快速开始,改变 `--input` 可以尝试用自己的音频或文本测试 安装完成后,开发者可以通过命令行或者Python快速开始,命令行模式下改变 `--input` 可以尝试用自己的音频或文本测试,支持16k wav格式音频
**声音分类** 你也可以在`aistudio`中快速体验 👉🏻[PaddleSpeech API Demo ](https://aistudio.baidu.com/aistudio/projectdetail/4281335?shared=1)
测试音频示例下载
```shell ```shell
paddlespeech cls --input input.wav wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
``` ```
**声纹识别**
### 语音识别
<details><summary>&emsp;(点击可展开)开源中文语音识别</summary>
命令行一键体验
```shell ```shell
paddlespeech vector --task spk --input input_16k.wav paddlespeech asr --lang zh --input zh.wav
``` ```
**语音识别**
Python API 一键预测
```python
>>> from paddlespeech.cli.asr.infer import ASRExecutor
>>> asr = ASRExecutor()
>>> result = asr(audio_file="zh.wav")
>>> print(result)
我认为跑步最重要的就是给我带来了身体健康
```
</details>
### 语音合成
<details><summary>&emsp;开源中文语音合成</summary>
输出 24k 采样率wav格式音频
命令行一键体验
```shell ```shell
paddlespeech asr --lang zh --input input_16k.wav paddlespeech tts --input "你好,欢迎使用百度飞桨深度学习框架!" --output output.wav
``` ```
**语音翻译** (English to Chinese)
Python API 一键预测
```python
>>> from paddlespeech.cli.tts.infer import TTSExecutor
>>> tts = TTSExecutor()
>>> tts(text="今天天气十分不错。", output="output.wav")
```
- 语音合成的 web demo 已经集成进了 [Huggingface Spaces](https://huggingface.co/spaces). 请参考: [TTS Demo](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS)
</details>
### 声音分类
<details><summary>&emsp;适配多场景的开放领域声音分类工具</summary>
基于AudioSet数据集527个类别的声音分类模型
命令行一键体验
```shell ```shell
paddlespeech st --input input_16k.wav paddlespeech cls --input zh.wav
``` ```
**语音合成**
python API 一键预测
```python
>>> from paddlespeech.cli.cls.infer import CLSExecutor
>>> cls = CLSExecutor()
>>> result = cls(audio_file="zh.wav")
>>> print(result)
Speech 0.9027186632156372
```
</details>
### 声纹提取
<details><summary>&emsp;工业级声纹提取工具</summary>
命令行一键体验
```shell ```shell
paddlespeech tts --input "你好,欢迎使用百度飞桨深度学习框架!" --output output.wav paddlespeech vector --task spk --input zh.wav
```
Python API 一键预测
```python
>>> from paddlespeech.cli.vector import VectorExecutor
>>> vec = VectorExecutor()
>>> result = vec(audio_file="zh.wav")
>>> print(result) # 187维向量
[ -0.19083306 9.474295 -14.122263 -2.0916545 0.04848729
4.9295826 1.4780062 0.3733844 10.695862 3.2697146
-4.48199 -0.6617882 -9.170393 -11.1568775 -1.2358263 ...]
``` ```
- 语音合成的 web demo 已经集成进了 [Huggingface Spaces](https://huggingface.co/spaces). 请参考: [TTS Demo](https://huggingface.co/spaces/akhaliq/paddlespeech)
**文本后处理** </details>
- 标点恢复
```bash
paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭
```
**批处理** ### 标点恢复
<details><summary>&emsp;一键恢复文本标点,可与ASR模型配合使用</summary>
命令行一键体验
```shell
paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭
``` ```
echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts
Python API 一键预测
```python
>>> from paddlespeech.cli.text.infer import TextExecutor
>>> text_punc = TextExecutor()
>>> result = text_punc(text="今天的天气真不错啊你下午有空吗我想约你一起去吃饭")
今天的天气真不错啊你下午有空吗我想约你一起去吃饭
``` ```
**Shell管道** </details>
ASR + Punc:
### 语音翻译
<details><summary>&emsp;端到端英译中语音翻译工具</summary>
使用预编译的kaldi相关工具,只支持在Ubuntu系统中体验
命令行一键体验
```shell
paddlespeech st --input en.wav
``` ```
paddlespeech asr --input ./zh.wav | paddlespeech text --task punc
python API 一键预测
```python
>>> from paddlespeech.cli.st.infer import STExecutor
>>> st = STExecutor()
>>> result = st(audio_file="en.wav")
['我 在 这栋 建筑 的 古老 门上 敲门 。']
``` ```
更多命令行命令请参考 [demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos) </details>
> Note: 如果需要训练或者微调,请查看[语音识别](./docs/source/asr/quick_start.md), [语音合成](./docs/source/tts/quick_start.md)。
<a name="快速使用服务"></a> <a name="快速使用服务"></a>
## 快速使用服务 ## 快速使用服务
安装完成后,开发者可以通过命令行快速使用服务。 安装完成后,开发者可以通过命令行一键启动语音识别,语音合成,音频分类三种服务。
**启动服务** **启动服务**
```shell ```shell
...@@ -614,6 +751,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声 ...@@ -614,6 +751,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
语音合成模块最初被称为 [Parakeet](https://github.com/PaddlePaddle/Parakeet),现在与此仓库合并。如果您对该任务的学术研究感兴趣,请参阅 [TTS 研究概述](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/docs/source/tts#overview)。此外,[模型介绍](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/tts/models_introduction.md) 是了解语音合成流程的一个很好的指南。 语音合成模块最初被称为 [Parakeet](https://github.com/PaddlePaddle/Parakeet),现在与此仓库合并。如果您对该任务的学术研究感兴趣,请参阅 [TTS 研究概述](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/docs/source/tts#overview)。此外,[模型介绍](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/tts/models_introduction.md) 是了解语音合成流程的一个很好的指南。
## ⭐ 应用案例 ## ⭐ 应用案例
- **[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo): 使用 PaddleSpeech 的语音合成模块生成虚拟人的声音。** - **[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo): 使用 PaddleSpeech 的语音合成模块生成虚拟人的声音。**
......
...@@ -5,14 +5,19 @@ ...@@ -5,14 +5,19 @@
## Introduction ## Introduction
This demo is an implementation of starting the voice service and accessing the service. It can be achieved with a single command using `paddlespeech_server` and `paddlespeech_client` or a few lines of code in python. This demo is an implementation of starting the voice service and accessing the service. It can be achieved with a single command using `paddlespeech_server` and `paddlespeech_client` or a few lines of code in python.
For service interface definition, please check:
- [PaddleSpeech Server RESTful API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-RESTful-API)
## Usage ## Usage
### 1. Installation ### 1. Installation
see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
It is recommended to use **paddlepaddle 2.2.2** or above. It is recommended to use **paddlepaddle 2.3.1** or above.
You can choose one way from easy, meduim and hard to install paddlespeech. You can choose one way from easy, meduim and hard to install paddlespeech.
**If you install in simple mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.**
**If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.**
### 2. Prepare config File ### 2. Prepare config File
The configuration file can be found in `conf/application.yaml` . The configuration file can be found in `conf/application.yaml` .
...@@ -47,7 +52,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -47,7 +52,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- `log_file`: log file. Default: ./log/paddlespeech.log - `log_file`: log file. Default: ./log/paddlespeech.log
Output: Output:
```bash ```text
[2022-02-23 11:17:32] [INFO] [server.py:64] Started server process [6384] [2022-02-23 11:17:32] [INFO] [server.py:64] Started server process [6384]
INFO: Waiting for application startup. INFO: Waiting for application startup.
[2022-02-23 11:17:32] [INFO] [on.py:26] Waiting for application startup. [2022-02-23 11:17:32] [INFO] [on.py:26] Waiting for application startup.
...@@ -55,7 +60,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -55,7 +60,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
[2022-02-23 11:17:32] [INFO] [on.py:38] Application startup complete. [2022-02-23 11:17:32] [INFO] [on.py:38] Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit)
[2022-02-23 11:17:32] [INFO] [server.py:204] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) [2022-02-23 11:17:32] [INFO] [server.py:204] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit)
``` ```
- Python API - Python API
...@@ -69,7 +73,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -69,7 +73,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
``` ```
Output: Output:
```bash ```text
INFO: Started server process [529] INFO: Started server process [529]
[2022-02-23 14:57:56] [INFO] [server.py:64] Started server process [529] [2022-02-23 14:57:56] [INFO] [server.py:64] Started server process [529]
INFO: Waiting for application startup. INFO: Waiting for application startup.
...@@ -78,7 +82,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -78,7 +82,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
[2022-02-23 14:57:56] [INFO] [on.py:38] Application startup complete. [2022-02-23 14:57:56] [INFO] [on.py:38] Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit)
[2022-02-23 14:57:56] [INFO] [server.py:204] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) [2022-02-23 14:57:56] [INFO] [server.py:204] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit)
``` ```
...@@ -106,7 +109,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -106,7 +109,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- `audio_format`: Audio format. Default: "wav". - `audio_format`: Audio format. Default: "wav".
Output: Output:
```bash ```text
[2022-02-23 18:11:22,819] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'transcription': '我认为跑步最重要的就是给我带来了身体健康'}} [2022-02-23 18:11:22,819] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'transcription': '我认为跑步最重要的就是给我带来了身体健康'}}
[2022-02-23 18:11:22,820] [ INFO] - time cost 0.689145 s. [2022-02-23 18:11:22,820] [ INFO] - time cost 0.689145 s.
...@@ -129,7 +132,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -129,7 +132,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
``` ```
Output: Output:
```bash ```text
{'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'transcription': '我认为跑步最重要的就是给我带来了身体健康'}} {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'transcription': '我认为跑步最重要的就是给我带来了身体健康'}}
``` ```
...@@ -158,12 +161,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -158,12 +161,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- `output`: Output wave filepath. Default: None, which means not to save the audio to the local. - `output`: Output wave filepath. Default: None, which means not to save the audio to the local.
Output: Output:
```bash ```text
[2022-02-23 15:20:37,875] [ INFO] - {'description': 'success.'} [2022-02-23 15:20:37,875] [ INFO] - {'description': 'success.'}
[2022-02-23 15:20:37,875] [ INFO] - Save synthesized audio successfully on output.wav. [2022-02-23 15:20:37,875] [ INFO] - Save synthesized audio successfully on output.wav.
[2022-02-23 15:20:37,875] [ INFO] - Audio duration: 3.612500 s. [2022-02-23 15:20:37,875] [ INFO] - Audio duration: 3.612500 s.
[2022-02-23 15:20:37,875] [ INFO] - Response time: 0.348050 s. [2022-02-23 15:20:37,875] [ INFO] - Response time: 0.348050 s.
``` ```
- Python API - Python API
...@@ -189,11 +191,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -189,11 +191,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
``` ```
Output: Output:
```bash ```text
{'description': 'success.'} {'description': 'success.'}
Save synthesized audio successfully on ./output.wav. Save synthesized audio successfully on ./output.wav.
Audio duration: 3.612500 s. Audio duration: 3.612500 s.
``` ```
### 6. CLS Client Usage ### 6. CLS Client Usage
...@@ -202,7 +203,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -202,7 +203,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
If `127.0.0.1` is not accessible, you need to use the actual service IP address. If `127.0.0.1` is not accessible, you need to use the actual service IP address.
``` ```bash
paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input ./zh.wav paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input ./zh.wav
``` ```
...@@ -218,11 +219,9 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -218,11 +219,9 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- `topk`: topk scores of classification result. - `topk`: topk scores of classification result.
Output: Output:
```bash ```text
[2022-03-09 20:44:39,974] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}} [2022-03-09 20:44:39,974] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}}
[2022-03-09 20:44:39,975] [ INFO] - Response time 0.104360 s. [2022-03-09 20:44:39,975] [ INFO] - Response time 0.104360 s.
``` ```
- Python API - Python API
...@@ -240,9 +239,8 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -240,9 +239,8 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
``` ```
Output: Output:
```bash ```text
{'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}} {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}}
``` ```
...@@ -274,7 +272,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -274,7 +272,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
Output: Output:
```bash ```text
[2022-05-25 12:25:36,165] [ INFO] - vector http client start [2022-05-25 12:25:36,165] [ INFO] - vector http client start
[2022-05-25 12:25:36,165] [ INFO] - the input audio: 85236145389.wav [2022-05-25 12:25:36,165] [ INFO] - the input audio: 85236145389.wav
[2022-05-25 12:25:36,165] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector [2022-05-25 12:25:36,165] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector
...@@ -299,7 +297,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -299,7 +297,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
Output: Output:
``` bash ```text
{'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [-1.3251205682754517, 7.860682487487793, -4.620625972747803, 0.3000721037387848, 2.2648534774780273, -1.1931440830230713, 3.064713716506958, 7.673594951629639, -6.004472732543945, -12.024259567260742, -1.9496068954467773, 3.126953601837158, 1.6188379526138306, -7.638310432434082, -1.2299772500991821, -12.33833122253418, 2.1373026371002197, -5.395712375640869, 9.717328071594238, 5.675230503082275, 3.7805123329162598, 3.0597171783447266, 3.429692029953003, 8.9760103225708, 13.174124717712402, -0.5313228368759155, 8.942471504211426, 4.465109825134277, -4.426247596740723, -9.726503372192383, 8.399328231811523, 7.223917484283447, -7.435853958129883, 2.9441683292388916, -4.343039512634277, -13.886964797973633, -1.6346734762191772, -10.902740478515625, -5.311244964599609, 3.800722122192383, 3.897603750228882, -2.123077392578125, -2.3521194458007812, 4.151031017303467, -7.404866695404053, 0.13911646604537964, 2.4626107215881348, 4.96645450592041, 0.9897574186325073, 5.483975410461426, -3.3574001789093018, 10.13400650024414, -0.6120170950889587, -10.403095245361328, 4.600754261016846, 16.009349822998047, -7.78369140625, -4.194530487060547, -6.93686056137085, 1.1789555549621582, 11.490800857543945, 4.23802375793457, 9.550930976867676, 8.375045776367188, 7.508914470672607, -0.6570729613304138, -0.3005157709121704, 2.8406054973602295, 3.0828027725219727, 0.7308170199394226, 6.1483540534973145, 0.1376611888408661, -13.424735069274902, -7.746140480041504, -2.322798252105713, -8.305252075195312, 2.98791241645813, -10.99522876739502, 0.15211068093776703, -2.3820347785949707, -1.7984174489974976, 8.49562931060791, -5.852236747741699, -3.755497932434082, 0.6989710927009583, -5.270299434661865, -2.6188621520996094, -1.8828465938568115, -4.6466498374938965, 14.078543663024902, -0.5495333075523376, 10.579157829284668, -3.216050148010254, 9.349003791809082, -4.381077766418457, -11.675816535949707, -2.863020658493042, 4.5721755027771, 2.246612071990967, -4.574341773986816, 1.8610187768936157, 2.3767874240875244, 5.625787734985352, -9.784077644348145, 0.6496725678443909, -1.457950472831726, 0.4263263940811157, -4.921126365661621, -2.4547839164733887, 3.4869801998138428, -0.4265422224998474, 8.341268539428711, 1.356552004814148, 7.096688270568848, -13.102828979492188, 8.01673412322998, -7.115934371948242, 1.8699780702590942, 0.20872099697589874, 14.699383735656738, -1.0252779722213745, -2.6107232570648193, -2.5082311630249023, 8.427192687988281, 6.913852691650391, -6.29124641418457, 0.6157366037368774, 2.489687919616699, -3.4668266773223877, 9.92176342010498, 11.200815200805664, -0.19664029777050018, 7.491600513458252, -0.6231271624565125, -0.2584814429283142, -9.947997093200684, -0.9611040949821472, 1.1649218797683716, -2.1907122135162354, -1.502848744392395, -0.5192610621452332, 15.165953636169434, 2.4649462699890137, -0.998044490814209, 7.44166374206543, -2.0768048763275146, 3.5896823406219482, -7.305543422698975, -7.562084674835205, 4.32333517074585, 0.08044180274009705, -6.564010143280029, -2.314805269241333, -1.7642345428466797, -2.470881700515747, -7.6756181716918945, -9.548877716064453, -1.017755389213562, 0.1698644608259201, 2.5877134799957275, -1.8752295970916748, -0.36614322662353516, -6.049378395080566, -2.3965611457824707, -5.945338726043701, 0.9424033164978027, -13.155974388122559, -7.45780086517334, 0.14658108353614807, -3.7427968978881836, 5.841492652893066, -1.2872905731201172, 5.569431304931641, 12.570590019226074, 1.0939218997955322, 2.2142086029052734, 1.9181575775146484, 6.991420745849609, -5.888138771057129, 3.1409823894500732, -2.0036280155181885, 2.4434285163879395, 9.973138809204102, 5.036680221557617, 2.005120277404785, 2.861560344696045, 5.860223770141602, 2.917618751525879, -1.63111412525177, 2.0292205810546875, -4.070415019989014, -6.831437110900879]}} {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [-1.3251205682754517, 7.860682487487793, -4.620625972747803, 0.3000721037387848, 2.2648534774780273, -1.1931440830230713, 3.064713716506958, 7.673594951629639, -6.004472732543945, -12.024259567260742, -1.9496068954467773, 3.126953601837158, 1.6188379526138306, -7.638310432434082, -1.2299772500991821, -12.33833122253418, 2.1373026371002197, -5.395712375640869, 9.717328071594238, 5.675230503082275, 3.7805123329162598, 3.0597171783447266, 3.429692029953003, 8.9760103225708, 13.174124717712402, -0.5313228368759155, 8.942471504211426, 4.465109825134277, -4.426247596740723, -9.726503372192383, 8.399328231811523, 7.223917484283447, -7.435853958129883, 2.9441683292388916, -4.343039512634277, -13.886964797973633, -1.6346734762191772, -10.902740478515625, -5.311244964599609, 3.800722122192383, 3.897603750228882, -2.123077392578125, -2.3521194458007812, 4.151031017303467, -7.404866695404053, 0.13911646604537964, 2.4626107215881348, 4.96645450592041, 0.9897574186325073, 5.483975410461426, -3.3574001789093018, 10.13400650024414, -0.6120170950889587, -10.403095245361328, 4.600754261016846, 16.009349822998047, -7.78369140625, -4.194530487060547, -6.93686056137085, 1.1789555549621582, 11.490800857543945, 4.23802375793457, 9.550930976867676, 8.375045776367188, 7.508914470672607, -0.6570729613304138, -0.3005157709121704, 2.8406054973602295, 3.0828027725219727, 0.7308170199394226, 6.1483540534973145, 0.1376611888408661, -13.424735069274902, -7.746140480041504, -2.322798252105713, -8.305252075195312, 2.98791241645813, -10.99522876739502, 0.15211068093776703, -2.3820347785949707, -1.7984174489974976, 8.49562931060791, -5.852236747741699, -3.755497932434082, 0.6989710927009583, -5.270299434661865, -2.6188621520996094, -1.8828465938568115, -4.6466498374938965, 14.078543663024902, -0.5495333075523376, 10.579157829284668, -3.216050148010254, 9.349003791809082, -4.381077766418457, -11.675816535949707, -2.863020658493042, 4.5721755027771, 2.246612071990967, -4.574341773986816, 1.8610187768936157, 2.3767874240875244, 5.625787734985352, -9.784077644348145, 0.6496725678443909, -1.457950472831726, 0.4263263940811157, -4.921126365661621, -2.4547839164733887, 3.4869801998138428, -0.4265422224998474, 8.341268539428711, 1.356552004814148, 7.096688270568848, -13.102828979492188, 8.01673412322998, -7.115934371948242, 1.8699780702590942, 0.20872099697589874, 14.699383735656738, -1.0252779722213745, -2.6107232570648193, -2.5082311630249023, 8.427192687988281, 6.913852691650391, -6.29124641418457, 0.6157366037368774, 2.489687919616699, -3.4668266773223877, 9.92176342010498, 11.200815200805664, -0.19664029777050018, 7.491600513458252, -0.6231271624565125, -0.2584814429283142, -9.947997093200684, -0.9611040949821472, 1.1649218797683716, -2.1907122135162354, -1.502848744392395, -0.5192610621452332, 15.165953636169434, 2.4649462699890137, -0.998044490814209, 7.44166374206543, -2.0768048763275146, 3.5896823406219482, -7.305543422698975, -7.562084674835205, 4.32333517074585, 0.08044180274009705, -6.564010143280029, -2.314805269241333, -1.7642345428466797, -2.470881700515747, -7.6756181716918945, -9.548877716064453, -1.017755389213562, 0.1698644608259201, 2.5877134799957275, -1.8752295970916748, -0.36614322662353516, -6.049378395080566, -2.3965611457824707, -5.945338726043701, 0.9424033164978027, -13.155974388122559, -7.45780086517334, 0.14658108353614807, -3.7427968978881836, 5.841492652893066, -1.2872905731201172, 5.569431304931641, 12.570590019226074, 1.0939218997955322, 2.2142086029052734, 1.9181575775146484, 6.991420745849609, -5.888138771057129, 3.1409823894500732, -2.0036280155181885, 2.4434285163879395, 9.973138809204102, 5.036680221557617, 2.005120277404785, 2.861560344696045, 5.860223770141602, 2.917618751525879, -1.63111412525177, 2.0292205810546875, -4.070415019989014, -6.831437110900879]}}
``` ```
...@@ -331,7 +329,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -331,7 +329,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
Output: Output:
``` bash ```text
[2022-05-25 12:33:24,527] [ INFO] - vector score http client start [2022-05-25 12:33:24,527] [ INFO] - vector score http client start
[2022-05-25 12:33:24,527] [ INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav [2022-05-25 12:33:24,527] [ INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav
[2022-05-25 12:33:24,528] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector/score [2022-05-25 12:33:24,528] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector/score
...@@ -358,7 +356,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -358,7 +356,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
Output: Output:
``` bash ```text
[2022-05-25 12:30:14,143] [ INFO] - vector score http client start [2022-05-25 12:30:14,143] [ INFO] - vector score http client start
[2022-05-25 12:30:14,143] [ INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav [2022-05-25 12:30:14,143] [ INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav
[2022-05-25 12:30:14,143] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector/score [2022-05-25 12:30:14,143] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector/score
...@@ -389,7 +387,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -389,7 +387,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- `input`(required): Input text to get punctuation. - `input`(required): Input text to get punctuation.
Output: Output:
```bash ```text
[2022-05-09 18:19:04,397] [ INFO] - The punc text: 我认为跑步最重要的就是给我带来了身体健康。 [2022-05-09 18:19:04,397] [ INFO] - The punc text: 我认为跑步最重要的就是给我带来了身体健康。
[2022-05-09 18:19:04,397] [ INFO] - Response time 0.092407 s. [2022-05-09 18:19:04,397] [ INFO] - Response time 0.092407 s.
``` ```
...@@ -408,11 +406,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -408,11 +406,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
``` ```
Output: Output:
```bash ```text
我认为跑步最重要的就是给我带来了身体健康。 我认为跑步最重要的就是给我带来了身体健康。
``` ```
## Models supported by the service ## Models supported by the service
### ASR model ### ASR model
Get all models supported by the ASR service via `paddlespeech_server stats --task asr`, where static models can be used for paddle inference inference. Get all models supported by the ASR service via `paddlespeech_server stats --task asr`, where static models can be used for paddle inference inference.
......
...@@ -3,22 +3,29 @@ ...@@ -3,22 +3,29 @@
# 语音服务 # 语音服务
## 介绍 ## 介绍
这个 demo 是一个启动离线语音服务和访问服务的实现。它可以通过使用`paddlespeech_server``paddlespeech_client` 的单个命令或 python 的几行代码来实现。
这个 demo 是一个启动离线语音服务和访问服务的实现。它可以通过使用 `paddlespeech_server``paddlespeech_client` 的单个命令或 python 的几行代码来实现。
服务接口定义请参考:
- [PaddleSpeech Server RESTful API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-RESTful-API)
## 使用方法 ## 使用方法
### 1. 安装 ### 1. 安装
请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). 请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
推荐使用 **paddlepaddle 2.2.2** 或以上版本。 推荐使用 **paddlepaddle 2.3.1** 或以上版本。
你可以从简单,中等,困难几种方式中选择一种方式安装 PaddleSpeech。
**如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。**
你可以从简单,中等,困难 几种方式中选择一种方式安装 PaddleSpeech。
**如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。**
### 2. 准备配置文件 ### 2. 准备配置文件
配置文件可参见 `conf/application.yaml` 配置文件可参见 `conf/application.yaml`
其中,`engine_list`表示即将启动的服务将会包含的语音引擎,格式为 <语音任务>_<引擎类型> 其中,`engine_list` 表示即将启动的服务将会包含的语音引擎,格式为 <语音任务>_<引擎类型>
目前服务集成的语音任务有: asr(语音识别)、tts(语音合成)、cls(音频分类)、vector(声纹识别)以及 text(文本处理)。
目前服务集成的语音任务有: asr (语音识别)、tts (语音合成)、cls (音频分类)、vector (声纹识别)以及 text (文本处理)。
目前引擎类型支持两种形式:python 及 inference (Paddle Inference) 目前引擎类型支持两种形式:python 及 inference (Paddle Inference)
**注意:** 如果在容器里可正常启动服务,但客户端访问 ip 不可达,可尝试将配置文件中 `host` 地址换成本地 ip 地址。 **注意:** 如果在容器里可正常启动服务,但客户端访问 ip 不可达,可尝试将配置文件中 `host` 地址换成本地 ip 地址。
...@@ -48,7 +55,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -48,7 +55,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- `log_file`: log 文件. 默认:./log/paddlespeech.log - `log_file`: log 文件. 默认:./log/paddlespeech.log
输出: 输出:
```bash ```text
[2022-02-23 11:17:32] [INFO] [server.py:64] Started server process [6384] [2022-02-23 11:17:32] [INFO] [server.py:64] Started server process [6384]
INFO: Waiting for application startup. INFO: Waiting for application startup.
[2022-02-23 11:17:32] [INFO] [on.py:26] Waiting for application startup. [2022-02-23 11:17:32] [INFO] [on.py:26] Waiting for application startup.
...@@ -56,7 +63,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -56,7 +63,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
[2022-02-23 11:17:32] [INFO] [on.py:38] Application startup complete. [2022-02-23 11:17:32] [INFO] [on.py:38] Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit)
[2022-02-23 11:17:32] [INFO] [server.py:204] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) [2022-02-23 11:17:32] [INFO] [server.py:204] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit)
``` ```
- Python API - Python API
...@@ -70,7 +76,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -70,7 +76,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
``` ```
输出: 输出:
```bash ```text
INFO: Started server process [529] INFO: Started server process [529]
[2022-02-23 14:57:56] [INFO] [server.py:64] Started server process [529] [2022-02-23 14:57:56] [INFO] [server.py:64] Started server process [529]
INFO: Waiting for application startup. INFO: Waiting for application startup.
...@@ -79,7 +85,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -79,7 +85,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
[2022-02-23 14:57:56] [INFO] [on.py:38] Application startup complete. [2022-02-23 14:57:56] [INFO] [on.py:38] Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit)
[2022-02-23 14:57:56] [INFO] [server.py:204] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) [2022-02-23 14:57:56] [INFO] [server.py:204] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit)
``` ```
### 4. ASR 客户端使用方法 ### 4. ASR 客户端使用方法
...@@ -108,8 +113,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -108,8 +113,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- `audio_format`: 音频格式,默认值:wav。 - `audio_format`: 音频格式,默认值:wav。
输出: 输出:
```text
```bash
[2022-02-23 18:11:22,819] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'transcription': '我认为跑步最重要的就是给我带来了身体健康'}} [2022-02-23 18:11:22,819] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'transcription': '我认为跑步最重要的就是给我带来了身体健康'}}
[2022-02-23 18:11:22,820] [ INFO] - time cost 0.689145 s. [2022-02-23 18:11:22,820] [ INFO] - time cost 0.689145 s.
``` ```
...@@ -131,9 +135,8 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -131,9 +135,8 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
``` ```
输出: 输出:
```bash ```text
{'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'transcription': '我认为跑步最重要的就是给我带来了身体健康'}} {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'transcription': '我认为跑步最重要的就是给我带来了身体健康'}}
``` ```
### 5. TTS 客户端使用方法 ### 5. TTS 客户端使用方法
...@@ -162,7 +165,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -162,7 +165,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- `output`: 输出音频的路径, 默认值:None,表示不保存音频到本地。 - `output`: 输出音频的路径, 默认值:None,表示不保存音频到本地。
输出: 输出:
```bash ```text
[2022-02-23 15:20:37,875] [ INFO] - {'description': 'success.'} [2022-02-23 15:20:37,875] [ INFO] - {'description': 'success.'}
[2022-02-23 15:20:37,875] [ INFO] - Save synthesized audio successfully on output.wav. [2022-02-23 15:20:37,875] [ INFO] - Save synthesized audio successfully on output.wav.
[2022-02-23 15:20:37,875] [ INFO] - Audio duration: 3.612500 s. [2022-02-23 15:20:37,875] [ INFO] - Audio duration: 3.612500 s.
...@@ -192,11 +195,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -192,11 +195,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
``` ```
输出: 输出:
```bash ```text
{'description': 'success.'} {'description': 'success.'}
Save synthesized audio successfully on ./output.wav. Save synthesized audio successfully on ./output.wav.
Audio duration: 3.612500 s. Audio duration: 3.612500 s.
``` ```
### 6. CLS 客户端使用方法 ### 6. CLS 客户端使用方法
...@@ -207,7 +209,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -207,7 +209,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
`127.0.0.1` 不能访问,则需要使用实际服务 IP 地址 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址
``` ```bash
paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input ./zh.wav paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input ./zh.wav
``` ```
...@@ -223,11 +225,9 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -223,11 +225,9 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- `topk`: 分类结果的topk。 - `topk`: 分类结果的topk。
输出: 输出:
```bash ```text
[2022-03-09 20:44:39,974] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}} [2022-03-09 20:44:39,974] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}}
[2022-03-09 20:44:39,975] [ INFO] - Response time 0.104360 s. [2022-03-09 20:44:39,975] [ INFO] - Response time 0.104360 s.
``` ```
- Python API - Python API
...@@ -242,13 +242,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -242,13 +242,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
port=8090, port=8090,
topk=1) topk=1)
print(res.json()) print(res.json())
``` ```
输出: 输出:
```bash ```text
{'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}} {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}}
``` ```
### 7. 声纹客户端使用方法 ### 7. 声纹客户端使用方法
...@@ -259,7 +257,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -259,7 +257,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
`127.0.0.1` 不能访问,则需要使用实际服务 IP 地址 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址
``` bash ```bash
paddlespeech_client vector --task spk --server_ip 127.0.0.1 --port 8090 --input 85236145389.wav paddlespeech_client vector --task spk --server_ip 127.0.0.1 --port 8090 --input 85236145389.wav
``` ```
...@@ -275,9 +273,9 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -275,9 +273,9 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
* task: vector 的任务,可选spk或者score。默认是 spk。 * task: vector 的任务,可选spk或者score。默认是 spk。
* enroll: 注册音频;。 * enroll: 注册音频;。
* test: 测试音频。 * test: 测试音频。
输出:
``` bash 输出:
```text
[2022-05-25 12:25:36,165] [ INFO] - vector http client start [2022-05-25 12:25:36,165] [ INFO] - vector http client start
[2022-05-25 12:25:36,165] [ INFO] - the input audio: 85236145389.wav [2022-05-25 12:25:36,165] [ INFO] - the input audio: 85236145389.wav
[2022-05-25 12:25:36,165] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector [2022-05-25 12:25:36,165] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector
...@@ -301,8 +299,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -301,8 +299,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
``` ```
输出: 输出:
```text
``` bash
{'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [-1.3251205682754517, 7.860682487487793, -4.620625972747803, 0.3000721037387848, 2.2648534774780273, -1.1931440830230713, 3.064713716506958, 7.673594951629639, -6.004472732543945, -12.024259567260742, -1.9496068954467773, 3.126953601837158, 1.6188379526138306, -7.638310432434082, -1.2299772500991821, -12.33833122253418, 2.1373026371002197, -5.395712375640869, 9.717328071594238, 5.675230503082275, 3.7805123329162598, 3.0597171783447266, 3.429692029953003, 8.9760103225708, 13.174124717712402, -0.5313228368759155, 8.942471504211426, 4.465109825134277, -4.426247596740723, -9.726503372192383, 8.399328231811523, 7.223917484283447, -7.435853958129883, 2.9441683292388916, -4.343039512634277, -13.886964797973633, -1.6346734762191772, -10.902740478515625, -5.311244964599609, 3.800722122192383, 3.897603750228882, -2.123077392578125, -2.3521194458007812, 4.151031017303467, -7.404866695404053, 0.13911646604537964, 2.4626107215881348, 4.96645450592041, 0.9897574186325073, 5.483975410461426, -3.3574001789093018, 10.13400650024414, -0.6120170950889587, -10.403095245361328, 4.600754261016846, 16.009349822998047, -7.78369140625, -4.194530487060547, -6.93686056137085, 1.1789555549621582, 11.490800857543945, 4.23802375793457, 9.550930976867676, 8.375045776367188, 7.508914470672607, -0.6570729613304138, -0.3005157709121704, 2.8406054973602295, 3.0828027725219727, 0.7308170199394226, 6.1483540534973145, 0.1376611888408661, -13.424735069274902, -7.746140480041504, -2.322798252105713, -8.305252075195312, 2.98791241645813, -10.99522876739502, 0.15211068093776703, -2.3820347785949707, -1.7984174489974976, 8.49562931060791, -5.852236747741699, -3.755497932434082, 0.6989710927009583, -5.270299434661865, -2.6188621520996094, -1.8828465938568115, -4.6466498374938965, 14.078543663024902, -0.5495333075523376, 10.579157829284668, -3.216050148010254, 9.349003791809082, -4.381077766418457, -11.675816535949707, -2.863020658493042, 4.5721755027771, 2.246612071990967, -4.574341773986816, 1.8610187768936157, 2.3767874240875244, 5.625787734985352, -9.784077644348145, 0.6496725678443909, -1.457950472831726, 0.4263263940811157, -4.921126365661621, -2.4547839164733887, 3.4869801998138428, -0.4265422224998474, 8.341268539428711, 1.356552004814148, 7.096688270568848, -13.102828979492188, 8.01673412322998, -7.115934371948242, 1.8699780702590942, 0.20872099697589874, 14.699383735656738, -1.0252779722213745, -2.6107232570648193, -2.5082311630249023, 8.427192687988281, 6.913852691650391, -6.29124641418457, 0.6157366037368774, 2.489687919616699, -3.4668266773223877, 9.92176342010498, 11.200815200805664, -0.19664029777050018, 7.491600513458252, -0.6231271624565125, -0.2584814429283142, -9.947997093200684, -0.9611040949821472, 1.1649218797683716, -2.1907122135162354, -1.502848744392395, -0.5192610621452332, 15.165953636169434, 2.4649462699890137, -0.998044490814209, 7.44166374206543, -2.0768048763275146, 3.5896823406219482, -7.305543422698975, -7.562084674835205, 4.32333517074585, 0.08044180274009705, -6.564010143280029, -2.314805269241333, -1.7642345428466797, -2.470881700515747, -7.6756181716918945, -9.548877716064453, -1.017755389213562, 0.1698644608259201, 2.5877134799957275, -1.8752295970916748, -0.36614322662353516, -6.049378395080566, -2.3965611457824707, -5.945338726043701, 0.9424033164978027, -13.155974388122559, -7.45780086517334, 0.14658108353614807, -3.7427968978881836, 5.841492652893066, -1.2872905731201172, 5.569431304931641, 12.570590019226074, 1.0939218997955322, 2.2142086029052734, 1.9181575775146484, 6.991420745849609, -5.888138771057129, 3.1409823894500732, -2.0036280155181885, 2.4434285163879395, 9.973138809204102, 5.036680221557617, 2.005120277404785, 2.861560344696045, 5.860223770141602, 2.917618751525879, -1.63111412525177, 2.0292205810546875, -4.070415019989014, -6.831437110900879]}} {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [-1.3251205682754517, 7.860682487487793, -4.620625972747803, 0.3000721037387848, 2.2648534774780273, -1.1931440830230713, 3.064713716506958, 7.673594951629639, -6.004472732543945, -12.024259567260742, -1.9496068954467773, 3.126953601837158, 1.6188379526138306, -7.638310432434082, -1.2299772500991821, -12.33833122253418, 2.1373026371002197, -5.395712375640869, 9.717328071594238, 5.675230503082275, 3.7805123329162598, 3.0597171783447266, 3.429692029953003, 8.9760103225708, 13.174124717712402, -0.5313228368759155, 8.942471504211426, 4.465109825134277, -4.426247596740723, -9.726503372192383, 8.399328231811523, 7.223917484283447, -7.435853958129883, 2.9441683292388916, -4.343039512634277, -13.886964797973633, -1.6346734762191772, -10.902740478515625, -5.311244964599609, 3.800722122192383, 3.897603750228882, -2.123077392578125, -2.3521194458007812, 4.151031017303467, -7.404866695404053, 0.13911646604537964, 2.4626107215881348, 4.96645450592041, 0.9897574186325073, 5.483975410461426, -3.3574001789093018, 10.13400650024414, -0.6120170950889587, -10.403095245361328, 4.600754261016846, 16.009349822998047, -7.78369140625, -4.194530487060547, -6.93686056137085, 1.1789555549621582, 11.490800857543945, 4.23802375793457, 9.550930976867676, 8.375045776367188, 7.508914470672607, -0.6570729613304138, -0.3005157709121704, 2.8406054973602295, 3.0828027725219727, 0.7308170199394226, 6.1483540534973145, 0.1376611888408661, -13.424735069274902, -7.746140480041504, -2.322798252105713, -8.305252075195312, 2.98791241645813, -10.99522876739502, 0.15211068093776703, -2.3820347785949707, -1.7984174489974976, 8.49562931060791, -5.852236747741699, -3.755497932434082, 0.6989710927009583, -5.270299434661865, -2.6188621520996094, -1.8828465938568115, -4.6466498374938965, 14.078543663024902, -0.5495333075523376, 10.579157829284668, -3.216050148010254, 9.349003791809082, -4.381077766418457, -11.675816535949707, -2.863020658493042, 4.5721755027771, 2.246612071990967, -4.574341773986816, 1.8610187768936157, 2.3767874240875244, 5.625787734985352, -9.784077644348145, 0.6496725678443909, -1.457950472831726, 0.4263263940811157, -4.921126365661621, -2.4547839164733887, 3.4869801998138428, -0.4265422224998474, 8.341268539428711, 1.356552004814148, 7.096688270568848, -13.102828979492188, 8.01673412322998, -7.115934371948242, 1.8699780702590942, 0.20872099697589874, 14.699383735656738, -1.0252779722213745, -2.6107232570648193, -2.5082311630249023, 8.427192687988281, 6.913852691650391, -6.29124641418457, 0.6157366037368774, 2.489687919616699, -3.4668266773223877, 9.92176342010498, 11.200815200805664, -0.19664029777050018, 7.491600513458252, -0.6231271624565125, -0.2584814429283142, -9.947997093200684, -0.9611040949821472, 1.1649218797683716, -2.1907122135162354, -1.502848744392395, -0.5192610621452332, 15.165953636169434, 2.4649462699890137, -0.998044490814209, 7.44166374206543, -2.0768048763275146, 3.5896823406219482, -7.305543422698975, -7.562084674835205, 4.32333517074585, 0.08044180274009705, -6.564010143280029, -2.314805269241333, -1.7642345428466797, -2.470881700515747, -7.6756181716918945, -9.548877716064453, -1.017755389213562, 0.1698644608259201, 2.5877134799957275, -1.8752295970916748, -0.36614322662353516, -6.049378395080566, -2.3965611457824707, -5.945338726043701, 0.9424033164978027, -13.155974388122559, -7.45780086517334, 0.14658108353614807, -3.7427968978881836, 5.841492652893066, -1.2872905731201172, 5.569431304931641, 12.570590019226074, 1.0939218997955322, 2.2142086029052734, 1.9181575775146484, 6.991420745849609, -5.888138771057129, 3.1409823894500732, -2.0036280155181885, 2.4434285163879395, 9.973138809204102, 5.036680221557617, 2.005120277404785, 2.861560344696045, 5.860223770141602, 2.917618751525879, -1.63111412525177, 2.0292205810546875, -4.070415019989014, -6.831437110900879]}}
``` ```
...@@ -332,8 +329,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -332,8 +329,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
* test: 测试音频。 * test: 测试音频。
输出: 输出:
```text
``` bash
[2022-05-25 12:33:24,527] [ INFO] - vector score http client start [2022-05-25 12:33:24,527] [ INFO] - vector score http client start
[2022-05-25 12:33:24,527] [ INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav [2022-05-25 12:33:24,527] [ INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav
[2022-05-25 12:33:24,528] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector/score [2022-05-25 12:33:24,528] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector/score
...@@ -344,7 +340,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -344,7 +340,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
* Python API * Python API
``` python ```python
from paddlespeech.server.bin.paddlespeech_client import VectorClientExecutor from paddlespeech.server.bin.paddlespeech_client import VectorClientExecutor
vectorclient_executor = VectorClientExecutor() vectorclient_executor = VectorClientExecutor()
...@@ -359,8 +355,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -359,8 +355,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
``` ```
输出: 输出:
```text
``` bash
[2022-05-25 12:30:14,143] [ INFO] - vector score http client start [2022-05-25 12:30:14,143] [ INFO] - vector score http client start
[2022-05-25 12:30:14,143] [ INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav [2022-05-25 12:30:14,143] [ INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav
[2022-05-25 12:30:14,143] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector/score [2022-05-25 12:30:14,143] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector/score
...@@ -368,7 +363,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -368,7 +363,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
{'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}} {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}}
``` ```
### 8. 标点预测 ### 8. 标点预测
**注意:** 初次使用客户端时响应时间会略长 **注意:** 初次使用客户端时响应时间会略长
...@@ -391,7 +385,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -391,7 +385,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- `input`(必须输入): 用于标点预测的文本内容。 - `input`(必须输入): 用于标点预测的文本内容。
输出: 输出:
```bash ```text
[2022-05-09 18:19:04,397] [ INFO] - The punc text: 我认为跑步最重要的就是给我带来了身体健康。 [2022-05-09 18:19:04,397] [ INFO] - The punc text: 我认为跑步最重要的就是给我带来了身体健康。
[2022-05-09 18:19:04,397] [ INFO] - Response time 0.092407 s. [2022-05-09 18:19:04,397] [ INFO] - Response time 0.092407 s.
``` ```
...@@ -406,11 +400,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -406,11 +400,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
server_ip="127.0.0.1", server_ip="127.0.0.1",
port=8090,) port=8090,)
print(res) print(res)
``` ```
输出: 输出:
```bash ```text
我认为跑步最重要的就是给我带来了身体健康。 我认为跑步最重要的就是给我带来了身体健康。
``` ```
......
...@@ -8,7 +8,7 @@ http://0.0.0.0:8010/docs ...@@ -8,7 +8,7 @@ http://0.0.0.0:8010/docs
### 【POST】/asr/offline ### 【POST】/asr/offline
说明:上传16k,16bit wav文件,返回 offline 语音识别模型识别结果 说明:上传 16k, 16bit wav 文件,返回 offline 语音识别模型识别结果
返回: JSON 返回: JSON
...@@ -26,11 +26,11 @@ http://0.0.0.0:8010/docs ...@@ -26,11 +26,11 @@ http://0.0.0.0:8010/docs
### 【POST】/asr/offlinefile ### 【POST】/asr/offlinefile
说明:上传16k,16bit wav文件,返回 offline 语音识别模型识别结果 + wav数据的base64 说明:上传16k,16bit wav文件,返回 offline 语音识别模型识别结果 + wav 数据的 base64
返回: JSON 返回: JSON
前端接口: 音频文件识别(播放这段base64还原后记得添加wav头,采样率16k, int16,添加后才能播放) 前端接口: 音频文件识别(播放这段base64还原后记得添加 wav 头,采样率 16k, int16,添加后才能播放)
示例: 示例:
...@@ -48,7 +48,7 @@ http://0.0.0.0:8010/docs ...@@ -48,7 +48,7 @@ http://0.0.0.0:8010/docs
### 【POST】/asr/collectEnv ### 【POST】/asr/collectEnv
说明: 通过采集环境噪音,上传16k, int16 wav文件,来生成后台VAD的能量阈值, 返回阈值结果 说明: 通过采集环境噪音,上传 16k, int16 wav 文件,来生成后台 VAD 的能量阈值, 返回阈值结果
前端接口:ASR-环境采样 前端接口:ASR-环境采样
...@@ -64,9 +64,9 @@ http://0.0.0.0:8010/docs ...@@ -64,9 +64,9 @@ http://0.0.0.0:8010/docs
### 【GET】/asr/stopRecord ### 【GET】/asr/stopRecord
说明:通过 GET 请求 /asr/stopRecord, 后台停止接收 offlineStream 中通过 WS协议 上传的数据 说明:通过 GET 请求 /asr/stopRecord, 后台停止接收 offlineStream 中通过 WS 协议 上传的数据
前端接口:语音聊天-暂停录音(获取NLP,播放TTS时暂停) 前端接口:语音聊天-暂停录音(获取 NLP,播放 TTS 时暂停)
返回: JSON 返回: JSON
...@@ -80,9 +80,9 @@ http://0.0.0.0:8010/docs ...@@ -80,9 +80,9 @@ http://0.0.0.0:8010/docs
### 【GET】/asr/resumeRecord ### 【GET】/asr/resumeRecord
说明:通过 GET 请求 /asr/resumeRecord, 后台停止接收 offlineStream 中通过 WS协议 上传的数据 说明:通过 GET 请求 /asr/resumeRecord, 后台停止接收 offlineStream 中通过 WS 协议 上传的数据
前端接口:语音聊天-恢复录音(TTS播放完毕时,告诉后台恢复录音) 前端接口:语音聊天-恢复录音( TTS 播放完毕时,告诉后台恢复录音)
返回: JSON 返回: JSON
...@@ -100,16 +100,16 @@ http://0.0.0.0:8010/docs ...@@ -100,16 +100,16 @@ http://0.0.0.0:8010/docs
前端接口:语音聊天-开始录音,持续将麦克风语音传给后端,后端推送语音识别结果 前端接口:语音聊天-开始录音,持续将麦克风语音传给后端,后端推送语音识别结果
返回:后端返回识别结果,offline模型识别结果, 由WS推送 返回:后端返回识别结果,offline 模型识别结果, 由WS推送
### 【Websocket】/ws/asr/onlineStream ### 【Websocket】/ws/asr/onlineStream
说明:通过 WS 协议,将前端音频持续上传到后台,前端采集 16k,Int16 类型的PCM片段,持续上传到后端 说明:通过 WS 协议,将前端音频持续上传到后台,前端采集 16k,Int16 类型的 PCM 片段,持续上传到后端
前端接口:ASR-流式识别开始录音,持续将麦克风语音传给后端,后端推送语音识别结果 前端接口:ASR-流式识别开始录音,持续将麦克风语音传给后端,后端推送语音识别结果
返回:后端返回识别结果,online模型识别结果, 由WS推送 返回:后端返回识别结果,online 模型识别结果, 由 WS 推送
## NLP ## NLP
...@@ -202,7 +202,7 @@ http://0.0.0.0:8010/docs ...@@ -202,7 +202,7 @@ http://0.0.0.0:8010/docs
### 【POST】/tts/offline ### 【POST】/tts/offline
说明:获取TTS离线模型音频 说明:获取 TTS 离线模型音频
前端接口:TTS-端到端合成 前端接口:TTS-端到端合成
...@@ -272,7 +272,7 @@ curl -X 'POST' \ ...@@ -272,7 +272,7 @@ curl -X 'POST' \
### 【POST】/vpr/recog ### 【POST】/vpr/recog
说明:声纹识别,识别文件,提取文件的声纹信息做比对 音频 16k, int 16 wav格式 说明:声纹识别,识别文件,提取文件的声纹信息做比对 音频 16k, int 16 wav 格式
前端接口:声纹识别-上传音频,返回声纹识别结果 前端接口:声纹识别-上传音频,返回声纹识别结果
...@@ -383,9 +383,9 @@ curl -X 'GET' \ ...@@ -383,9 +383,9 @@ curl -X 'GET' \
### 【GET】/vpr/database64 ### 【GET】/vpr/database64
说明: 根据 vpr_id 获取用户vpr时注册使用音频转换成 16k, int16 类型的数组,返回base64编码 说明: 根据 vpr_id 获取用户 vpr 时注册使用音频转换成 16k, int16 类型的数组,返回 base64 编码
前端接口:声纹识别-获取vpr对应的音频(注意:播放时需要添加 wav头,16k,int16, 可参考tts播放时添加wav的方式,注意更改采样率) 前端接口:声纹识别-获取 vpr 对应的音频(注意:播放时需要添加 wav头,16k,int16, 可参考 tts 播放时添加 wav 的方式,注意更改采样率)
访问示例: 访问示例:
...@@ -402,5 +402,3 @@ curl -X 'GET' \ ...@@ -402,5 +402,3 @@ curl -X 'GET' \
"result":"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA", "result":"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",
"message": "ok" "message": "ok"
``` ```
\ No newline at end of file
# Paddle Speech Demo # Paddle Speech Demo
PaddleSpeechDemo是一个以PaddleSpeech的语音交互功能为主体开发的Demo展示项目,用于帮助大家更好的上手PaddleSpeech以及使用PaddleSpeech构建自己的应用。 PaddleSpeechDemo 是一个以 PaddleSpeech 的语音交互功能为主体开发的 Demo 展示项目,用于帮助大家更好的上手 PaddleSpeech 以及使用 PaddleSpeech 构建自己的应用。
智能语音交互部分使用PaddleSpeech,对话以及信息抽取部分使用PaddleNLP,网页前端展示部分基于Vue3进行开发 智能语音交互部分使用 PaddleSpeech,对话以及信息抽取部分使用 PaddleNLP,网页前端展示部分基于 Vue3 进行开发
主要功能: 主要功能:
+ 语音聊天:PaddleSpeech的语音识别能力+语音合成能力,对话部分基于PaddleNLP的闲聊功能 + 语音聊天:PaddleSpeech 的语音识别能力+语音合成能力,对话部分基于 PaddleNLP 的闲聊功能
+ 声纹识别:PaddleSpeech的声纹识别功能展示 + 声纹识别:PaddleSpeech 的声纹识别功能展示
+ 语音识别:支持【实时语音识别】,【端到端识别】,【音频文件识别】三种模式 + 语音识别:支持【实时语音识别】,【端到端识别】,【音频文件识别】三种模式
+ 语音合成:支持【流式合成】与【端到端合成】两种方式 + 语音合成:支持【流式合成】与【端到端合成】两种方式
+ 语音指令:基于PaddleSpeech的语音识别能力与PaddleNLP的信息抽取,实现交通费的智能报销 + 语音指令:基于 PaddleSpeech 的语音识别能力与 PaddleNLP 的信息抽取,实现交通费的智能报销
运行效果: 运行效果:
...@@ -32,23 +32,21 @@ cd model ...@@ -32,23 +32,21 @@ cd model
wget https://bj.bcebos.com/paddlenlp/applications/speech-cmd-analysis/finetune/model_state.pdparams wget https://bj.bcebos.com/paddlenlp/applications/speech-cmd-analysis/finetune/model_state.pdparams
``` ```
### 前端环境安装 ### 前端环境安装
前端依赖node.js ,需要提前安装,确保npm可用,npm测试版本8.3.1,建议下载[官网](https://nodejs.org/en/)稳定版的node.js 前端依赖 `node.js` ,需要提前安装,确保 `npm` 可用,`npm` 测试版本 `8.3.1`,建议下载[官网](https://nodejs.org/en/)稳定版的 `node.js`
``` ```
# 进入前端目录 # 进入前端目录
cd web_client cd web_client
# 安装yarn,已经安装可跳过 # 安装 `yarn`,已经安装可跳过
npm install -g yarn npm install -g yarn
# 使用yarn安装前端依赖 # 使用yarn安装前端依赖
yarn install yarn install
``` ```
## 启动服务 ## 启动服务
### 开启后端服务 ### 开启后端服务
...@@ -66,18 +64,18 @@ cd web_client ...@@ -66,18 +64,18 @@ cd web_client
yarn dev --port 8011 yarn dev --port 8011
``` ```
默认配置下,前端中配置的后台地址信息是localhost,确保后端服务器和打开页面的游览器在同一台机器上,不在一台机器的配置方式见下方的FAQ:【后端如果部署在其它机器或者别的端口如何修改】 默认配置下,前端中配置的后台地址信息是 localhost,确保后端服务器和打开页面的游览器在同一台机器上,不在一台机器的配置方式见下方的 FAQ:【后端如果部署在其它机器或者别的端口如何修改】
## FAQ ## FAQ
#### Q: 如何安装node.js #### Q: 如何安装node.js
A: node.js的安装可以参考[【菜鸟教程】](https://www.runoob.com/nodejs/nodejs-install-setup.html), 确保npm可用 A: node.js的安装可以参考[【菜鸟教程】](https://www.runoob.com/nodejs/nodejs-install-setup.html), 确保 npm 可用
#### Q:后端如果部署在其它机器或者别的端口如何修改 #### Q:后端如果部署在其它机器或者别的端口如何修改
A:后端的配置地址有分散在两个文件中 A:后端的配置地址有分散在两个文件中
修改第一个文件`PaddleSpeechWebClient/vite.config.js` 修改第一个文件 `PaddleSpeechWebClient/vite.config.js`
``` ```
server: { server: {
...@@ -92,7 +90,7 @@ server: { ...@@ -92,7 +90,7 @@ server: {
} }
``` ```
修改第二个文件`PaddleSpeechWebClient/src/api/API.js`(Websocket代理配置失败,所以需要在这个文件中修改) 修改第二个文件 `PaddleSpeechWebClient/src/api/API.js`( Websocket 代理配置失败,所以需要在这个文件中修改)
``` ```
// websocket (这里改成后端所在的接口) // websocket (这里改成后端所在的接口)
...@@ -107,9 +105,6 @@ A:这里主要是游览器安全策略的限制,需要配置游览器后重 ...@@ -107,9 +105,6 @@ A:这里主要是游览器安全策略的限制,需要配置游览器后重
chrome设置地址: chrome://flags/#unsafely-treat-insecure-origin-as-secure chrome设置地址: chrome://flags/#unsafely-treat-insecure-origin-as-secure
## 参考资料 ## 参考资料
vue实现录音参考资料:https://blog.csdn.net/qq_41619796/article/details/107865602#t1 vue实现录音参考资料:https://blog.csdn.net/qq_41619796/article/details/107865602#t1
......
...@@ -7,13 +7,18 @@ This demo is an implementation of starting the streaming speech service and acce ...@@ -7,13 +7,18 @@ This demo is an implementation of starting the streaming speech service and acce
Streaming ASR server only support `websocket` protocol, and doesn't support `http` protocol. Streaming ASR server only support `websocket` protocol, and doesn't support `http` protocol.
服务接口定义请参考:
- [PaddleSpeech Streaming Server WebSocket API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-WebSocket-API)
## Usage ## Usage
### 1. Installation ### 1. Installation
see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
It is recommended to use **paddlepaddle 2.2.1** or above. It is recommended to use **paddlepaddle 2.3.1** or above.
You can choose one way from easy, meduim and hard to install paddlespeech. You can choose one way from easy, meduim and hard to install paddlespeech.
**If you install in simple mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.**
**If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to
### 2. Prepare config File ### 2. Prepare config File
The configuration file can be found in `conf/ws_application.yaml``conf/ws_conformer_wenetspeech_application.yaml`. The configuration file can be found in `conf/ws_application.yaml``conf/ws_conformer_wenetspeech_application.yaml`.
...@@ -48,7 +53,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ...@@ -48,7 +53,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
- `log_file`: log file. Default: `./log/paddlespeech.log` - `log_file`: log file. Default: `./log/paddlespeech.log`
Output: Output:
```bash ```text
[2022-05-14 04:56:13,086] [ INFO] - create the online asr engine instance [2022-05-14 04:56:13,086] [ INFO] - create the online asr engine instance
[2022-05-14 04:56:13,086] [ INFO] - paddlespeech_server set the device: cpu [2022-05-14 04:56:13,086] [ INFO] - paddlespeech_server set the device: cpu
[2022-05-14 04:56:13,087] [ INFO] - Load the pretrained model, tag = conformer_online_wenetspeech-zh-16k [2022-05-14 04:56:13,087] [ INFO] - Load the pretrained model, tag = conformer_online_wenetspeech-zh-16k
...@@ -85,7 +90,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ...@@ -85,7 +90,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
``` ```
Output: Output:
```bash ```text
[2022-05-14 04:56:13,086] [ INFO] - create the online asr engine instance [2022-05-14 04:56:13,086] [ INFO] - create the online asr engine instance
[2022-05-14 04:56:13,086] [ INFO] - paddlespeech_server set the device: cpu [2022-05-14 04:56:13,086] [ INFO] - paddlespeech_server set the device: cpu
[2022-05-14 04:56:13,087] [ INFO] - Load the pretrained model, tag = conformer_online_wenetspeech-zh-16k [2022-05-14 04:56:13,087] [ INFO] - Load the pretrained model, tag = conformer_online_wenetspeech-zh-16k
...@@ -117,7 +122,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ...@@ -117,7 +122,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
If `127.0.0.1` is not accessible, you need to use the actual service IP address. If `127.0.0.1` is not accessible, you need to use the actual service IP address.
``` ```bash
paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input ./zh.wav paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input ./zh.wav
``` ```
...@@ -126,6 +131,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ...@@ -126,6 +131,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
```bash ```bash
paddlespeech_client asr_online --help paddlespeech_client asr_online --help
``` ```
Arguments: Arguments:
- `server_ip`: server ip. Default: 127.0.0.1 - `server_ip`: server ip. Default: 127.0.0.1
- `port`: server port. Default: 8090 - `port`: server port. Default: 8090
...@@ -137,7 +143,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ...@@ -137,7 +143,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
- `punc.server_port`: punctuation server port. Default: None. - `punc.server_port`: punctuation server port. Default: None.
Output: Output:
```bash ```text
[2022-05-06 21:10:35,598] [ INFO] - Start to do streaming asr client [2022-05-06 21:10:35,598] [ INFO] - Start to do streaming asr client
[2022-05-06 21:10:35,600] [ INFO] - asr websocket client start [2022-05-06 21:10:35,600] [ INFO] - asr websocket client start
[2022-05-06 21:10:35,600] [ INFO] - endpoint: ws://127.0.0.1:8390/paddlespeech/asr/streaming [2022-05-06 21:10:35,600] [ INFO] - endpoint: ws://127.0.0.1:8390/paddlespeech/asr/streaming
...@@ -205,7 +211,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ...@@ -205,7 +211,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
[2022-05-06 21:10:44,827] [ INFO] - client final receive msg={'status': 'ok', 'signal': 'finished', 'result': '我认为跑步最重要的就是给我带来了身体健康', 'times': [{'w': '我', 'bg': 0.0, 'ed': 0.7000000000000001}, {'w': '认', 'bg': 0.7000000000000001, 'ed': 0.84}, {'w': '为', 'bg': 0.84, 'ed': 1.0}, {'w': '跑', 'bg': 1.0, 'ed': 1.18}, {'w': '步', 'bg': 1.18, 'ed': 1.36}, {'w': '最', 'bg': 1.36, 'ed': 1.5}, {'w': '重', 'bg': 1.5, 'ed': 1.6400000000000001}, {'w': '要', 'bg': 1.6400000000000001, 'ed': 1.78}, {'w': '的', 'bg': 1.78, 'ed': 1.9000000000000001}, {'w': '就', 'bg': 1.9000000000000001, 'ed': 2.06}, {'w': '是', 'bg': 2.06, 'ed': 2.62}, {'w': '给', 'bg': 2.62, 'ed': 3.16}, {'w': '我', 'bg': 3.16, 'ed': 3.3200000000000003}, {'w': '带', 'bg': 3.3200000000000003, 'ed': 3.48}, {'w': '来', 'bg': 3.48, 'ed': 3.62}, {'w': '了', 'bg': 3.62, 'ed': 3.7600000000000002}, {'w': '身', 'bg': 3.7600000000000002, 'ed': 3.9}, {'w': '体', 'bg': 3.9, 'ed': 4.0600000000000005}, {'w': '健', 'bg': 4.0600000000000005, 'ed': 4.26}, {'w': '康', 'bg': 4.26, 'ed': 4.96}]} [2022-05-06 21:10:44,827] [ INFO] - client final receive msg={'status': 'ok', 'signal': 'finished', 'result': '我认为跑步最重要的就是给我带来了身体健康', 'times': [{'w': '我', 'bg': 0.0, 'ed': 0.7000000000000001}, {'w': '认', 'bg': 0.7000000000000001, 'ed': 0.84}, {'w': '为', 'bg': 0.84, 'ed': 1.0}, {'w': '跑', 'bg': 1.0, 'ed': 1.18}, {'w': '步', 'bg': 1.18, 'ed': 1.36}, {'w': '最', 'bg': 1.36, 'ed': 1.5}, {'w': '重', 'bg': 1.5, 'ed': 1.6400000000000001}, {'w': '要', 'bg': 1.6400000000000001, 'ed': 1.78}, {'w': '的', 'bg': 1.78, 'ed': 1.9000000000000001}, {'w': '就', 'bg': 1.9000000000000001, 'ed': 2.06}, {'w': '是', 'bg': 2.06, 'ed': 2.62}, {'w': '给', 'bg': 2.62, 'ed': 3.16}, {'w': '我', 'bg': 3.16, 'ed': 3.3200000000000003}, {'w': '带', 'bg': 3.3200000000000003, 'ed': 3.48}, {'w': '来', 'bg': 3.48, 'ed': 3.62}, {'w': '了', 'bg': 3.62, 'ed': 3.7600000000000002}, {'w': '身', 'bg': 3.7600000000000002, 'ed': 3.9}, {'w': '体', 'bg': 3.9, 'ed': 4.0600000000000005}, {'w': '健', 'bg': 4.0600000000000005, 'ed': 4.26}, {'w': '康', 'bg': 4.26, 'ed': 4.96}]}
[2022-05-06 21:10:44,827] [ INFO] - audio duration: 4.9968125, elapsed time: 9.225094079971313, RTF=1.846195765794957 [2022-05-06 21:10:44,827] [ INFO] - audio duration: 4.9968125, elapsed time: 9.225094079971313, RTF=1.846195765794957
[2022-05-06 21:10:44,828] [ INFO] - asr websocket client finished : 我认为跑步最重要的就是给我带来了身体健康 [2022-05-06 21:10:44,828] [ INFO] - asr websocket client finished : 我认为跑步最重要的就是给我带来了身体健康
``` ```
- Python API - Python API
...@@ -224,7 +229,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ...@@ -224,7 +229,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
``` ```
Output: Output:
```bash ```text
[2022-05-06 21:14:03,137] [ INFO] - asr websocket client start [2022-05-06 21:14:03,137] [ INFO] - asr websocket client start
[2022-05-06 21:14:03,137] [ INFO] - endpoint: ws://127.0.0.1:8390/paddlespeech/asr/streaming [2022-05-06 21:14:03,137] [ INFO] - endpoint: ws://127.0.0.1:8390/paddlespeech/asr/streaming
[2022-05-06 21:14:03,149] [ INFO] - client receive msg={"status": "ok", "signal": "server_ready"} [2022-05-06 21:14:03,149] [ INFO] - client receive msg={"status": "ok", "signal": "server_ready"}
...@@ -299,12 +304,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ...@@ -299,12 +304,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
- Command Line - Command Line
**Note:** The default deployment of the server is on the 'CPU' device, which can be deployed on the 'GPU' by modifying the 'device' parameter in the service configuration file. **Note:** The default deployment of the server is on the 'CPU' device, which can be deployed on the 'GPU' by modifying the 'device' parameter in the service configuration file.
``` bash ```bash
In PaddleSpeech/demos/streaming_asr_server directory to lanuch punctuation service In PaddleSpeech/demos/streaming_asr_server directory to lanuch punctuation service
paddlespeech_server start --config_file conf/punc_application.yaml paddlespeech_server start --config_file conf/punc_application.yaml
``` ```
Usage: Usage:
```bash ```bash
paddlespeech_server start --help paddlespeech_server start --help
...@@ -316,7 +320,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ...@@ -316,7 +320,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
Output: Output:
``` bash ```text
[2022-05-02 17:59:26,285] [ INFO] - Create the TextEngine Instance [2022-05-02 17:59:26,285] [ INFO] - Create the TextEngine Instance
[2022-05-02 17:59:26,285] [ INFO] - Init the text engine [2022-05-02 17:59:26,285] [ INFO] - Init the text engine
[2022-05-02 17:59:26,285] [ INFO] - Text Engine set the device: gpu:0 [2022-05-02 17:59:26,285] [ INFO] - Text Engine set the device: gpu:0
...@@ -349,7 +353,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ...@@ -349,7 +353,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
``` ```
Output: Output:
``` ```text
[2022-05-02 18:09:02,542] [ INFO] - Create the TextEngine Instance [2022-05-02 18:09:02,542] [ INFO] - Create the TextEngine Instance
[2022-05-02 18:09:02,543] [ INFO] - Init the text engine [2022-05-02 18:09:02,543] [ INFO] - Init the text engine
[2022-05-02 18:09:02,543] [ INFO] - Text Engine set the device: gpu:0 [2022-05-02 18:09:02,543] [ INFO] - Text Engine set the device: gpu:0
...@@ -376,17 +380,17 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ...@@ -376,17 +380,17 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
If `127.0.0.1` is not accessible, you need to use the actual service IP address. If `127.0.0.1` is not accessible, you need to use the actual service IP address.
``` ```bash
paddlespeech_client text --server_ip 127.0.0.1 --port 8190 --input "我认为跑步最重要的就是给我带来了身体健康" paddlespeech_client text --server_ip 127.0.0.1 --port 8190 --input "我认为跑步最重要的就是给我带来了身体健康"
``` ```
Output Output
``` ```text
[2022-05-02 18:12:29,767] [ INFO] - The punc text: 我认为跑步最重要的就是给我带来了身体健康。 [2022-05-02 18:12:29,767] [ INFO] - The punc text: 我认为跑步最重要的就是给我带来了身体健康。
[2022-05-02 18:12:29,767] [ INFO] - Response time 0.096548 s. [2022-05-02 18:12:29,767] [ INFO] - Response time 0.096548 s.
``` ```
- Python3 API - Python API
```python ```python
from paddlespeech.server.bin.paddlespeech_client import TextClientExecutor from paddlespeech.server.bin.paddlespeech_client import TextClientExecutor
...@@ -400,11 +404,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ...@@ -400,11 +404,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
``` ```
Output: Output:
``` bash ```text
我认为跑步最重要的就是给我带来了身体健康。 我认为跑步最重要的就是给我带来了身体健康。
``` ```
## Join streaming asr and punctuation server ## Join streaming asr and punctuation server
By default, each server is deployed on the 'CPU' device and speech recognition and punctuation prediction can be deployed on different 'GPU' by modifying the' device 'parameter in the service configuration file respectively. By default, each server is deployed on the 'CPU' device and speech recognition and punctuation prediction can be deployed on different 'GPU' by modifying the' device 'parameter in the service configuration file respectively.
...@@ -413,7 +416,7 @@ We use `streaming_ asr_server.py` and `punc_server.py` two services to lanuch st ...@@ -413,7 +416,7 @@ We use `streaming_ asr_server.py` and `punc_server.py` two services to lanuch st
### 1. Start two server ### 1. Start two server
``` bash ```bash
Note: streaming speech recognition and punctuation prediction are configured on different graphics cards through configuration files Note: streaming speech recognition and punctuation prediction are configured on different graphics cards through configuration files
bash server.sh bash server.sh
``` ```
...@@ -423,11 +426,11 @@ bash server.sh ...@@ -423,11 +426,11 @@ bash server.sh
If `127.0.0.1` is not accessible, you need to use the actual service IP address. If `127.0.0.1` is not accessible, you need to use the actual service IP address.
``` ```bash
paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8290 --punc.server_ip 127.0.0.1 --punc.port 8190 --input ./zh.wav paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8290 --punc.server_ip 127.0.0.1 --punc.port 8190 --input ./zh.wav
``` ```
Output: Output:
``` ```text
[2022-05-07 11:21:47,060] [ INFO] - asr websocket client start [2022-05-07 11:21:47,060] [ INFO] - asr websocket client start
[2022-05-07 11:21:47,060] [ INFO] - endpoint: ws://127.0.0.1:8490/paddlespeech/asr/streaming [2022-05-07 11:21:47,060] [ INFO] - endpoint: ws://127.0.0.1:8490/paddlespeech/asr/streaming
[2022-05-07 11:21:47,080] [ INFO] - client receive msg={"status": "ok", "signal": "server_ready"} [2022-05-07 11:21:47,080] [ INFO] - client receive msg={"status": "ok", "signal": "server_ready"}
...@@ -501,11 +504,11 @@ bash server.sh ...@@ -501,11 +504,11 @@ bash server.sh
If `127.0.0.1` is not accessible, you need to use the actual service IP address. If `127.0.0.1` is not accessible, you need to use the actual service IP address.
``` ```bash
python3 websocket_client.py --server_ip 127.0.0.1 --port 8290 --punc.server_ip 127.0.0.1 --punc.port 8190 --wavfile ./zh.wav python3 websocket_client.py --server_ip 127.0.0.1 --port 8290 --punc.server_ip 127.0.0.1 --punc.port 8190 --wavfile ./zh.wav
``` ```
Output: Output:
``` ```text
[2022-05-07 11:11:02,984] [ INFO] - Start to do streaming asr client [2022-05-07 11:11:02,984] [ INFO] - Start to do streaming asr client
[2022-05-07 11:11:02,985] [ INFO] - asr websocket client start [2022-05-07 11:11:02,985] [ INFO] - asr websocket client start
[2022-05-07 11:11:02,985] [ INFO] - endpoint: ws://127.0.0.1:8490/paddlespeech/asr/streaming [2022-05-07 11:11:02,985] [ INFO] - endpoint: ws://127.0.0.1:8490/paddlespeech/asr/streaming
...@@ -574,5 +577,3 @@ bash server.sh ...@@ -574,5 +577,3 @@ bash server.sh
[2022-05-07 11:11:18,915] [ INFO] - audio duration: 4.9968125, elapsed time: 15.928460597991943, RTF=3.187724293835709 [2022-05-07 11:11:18,915] [ INFO] - audio duration: 4.9968125, elapsed time: 15.928460597991943, RTF=3.187724293835709
[2022-05-07 11:11:18,916] [ INFO] - asr websocket client finished : 我认为跑步最重要的就是给我带来了身体健康 [2022-05-07 11:11:18,916] [ INFO] - asr websocket client finished : 我认为跑步最重要的就是给我带来了身体健康
``` ```
...@@ -3,16 +3,21 @@ ...@@ -3,16 +3,21 @@
# 流式语音识别服务 # 流式语音识别服务
## 介绍 ## 介绍
这个demo是一个启动流式语音服务和访问服务的实现。 它可以通过使用`paddlespeech_server``paddlespeech_client`的单个命令或 python 的几行代码来实现。 这个 demo 是一个启动流式语音服务和访问服务的实现。 它可以通过使用 `paddlespeech_server``paddlespeech_client` 的单个命令或 python 的几行代码来实现。
**流式语音识别服务只支持 `weboscket` 协议,不支持 `http` 协议。** **流式语音识别服务只支持 `weboscket` 协议,不支持 `http` 协议。**
服务接口定义请参考:
- [PaddleSpeech Streaming Server WebSocket API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-WebSocket-API)
## 使用方法 ## 使用方法
### 1. 安装 ### 1. 安装
安装 PaddleSpeech 的详细过程请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md) 安装 PaddleSpeech 的详细过程请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md)
推荐使用 **paddlepaddle 2.2.1** 或以上版本。 推荐使用 **paddlepaddle 2.3.1** 或以上版本。
你可以从简单, 中等,困难 几种种方式中选择一种方式安装 PaddleSpeech。
你可以从简单,中等,困难 几种方式中选择一种方式安装 PaddleSpeech。
**如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。** **如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。**
### 2. 准备配置文件 ### 2. 准备配置文件
...@@ -26,7 +31,6 @@ ...@@ -26,7 +31,6 @@
* conformer: `conf/ws_conformer_wenetspeech_application.yaml` * conformer: `conf/ws_conformer_wenetspeech_application.yaml`
这个 ASR client 的输入应该是一个 WAV 文件(`.wav`),并且采样率必须与模型的采样率相同。 这个 ASR client 的输入应该是一个 WAV 文件(`.wav`),并且采样率必须与模型的采样率相同。
可以下载此 ASR client的示例音频: 可以下载此 ASR client的示例音频:
...@@ -54,7 +58,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ...@@ -54,7 +58,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
- `log_file`: log 文件. 默认:`./log/paddlespeech.log` - `log_file`: log 文件. 默认:`./log/paddlespeech.log`
输出: 输出:
```bash ```text
[2022-05-14 04:56:13,086] [ INFO] - create the online asr engine instance [2022-05-14 04:56:13,086] [ INFO] - create the online asr engine instance
[2022-05-14 04:56:13,086] [ INFO] - paddlespeech_server set the device: cpu [2022-05-14 04:56:13,086] [ INFO] - paddlespeech_server set the device: cpu
[2022-05-14 04:56:13,087] [ INFO] - Load the pretrained model, tag = conformer_online_wenetspeech-zh-16k [2022-05-14 04:56:13,087] [ INFO] - Load the pretrained model, tag = conformer_online_wenetspeech-zh-16k
...@@ -90,8 +94,8 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ...@@ -90,8 +94,8 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
log_file="./log/paddlespeech.log") log_file="./log/paddlespeech.log")
``` ```
输出 输出:
```bash ```text
[2022-05-14 04:56:13,086] [ INFO] - create the online asr engine instance [2022-05-14 04:56:13,086] [ INFO] - create the online asr engine instance
[2022-05-14 04:56:13,086] [ INFO] - paddlespeech_server set the device: cpu [2022-05-14 04:56:13,086] [ INFO] - paddlespeech_server set the device: cpu
[2022-05-14 04:56:13,087] [ INFO] - Load the pretrained model, tag = conformer_online_wenetspeech-zh-16k [2022-05-14 04:56:13,087] [ INFO] - Load the pretrained model, tag = conformer_online_wenetspeech-zh-16k
...@@ -122,7 +126,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ...@@ -122,7 +126,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
`127.0.0.1` 不能访问,则需要使用实际服务 IP 地址 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址
``` ```bash
paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input ./zh.wav paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input ./zh.wav
``` ```
...@@ -143,8 +147,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ...@@ -143,8 +147,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
- `punc.server_port` 标点预测服务的端口port。默认是None。 - `punc.server_port` 标点预测服务的端口port。默认是None。
输出: 输出:
```text
```bash
[2022-05-06 21:10:35,598] [ INFO] - Start to do streaming asr client [2022-05-06 21:10:35,598] [ INFO] - Start to do streaming asr client
[2022-05-06 21:10:35,600] [ INFO] - asr websocket client start [2022-05-06 21:10:35,600] [ INFO] - asr websocket client start
[2022-05-06 21:10:35,600] [ INFO] - endpoint: ws://127.0.0.1:8390/paddlespeech/asr/streaming [2022-05-06 21:10:35,600] [ INFO] - endpoint: ws://127.0.0.1:8390/paddlespeech/asr/streaming
...@@ -230,7 +233,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ...@@ -230,7 +233,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
``` ```
输出: 输出:
```bash ```text
[2022-05-06 21:14:03,137] [ INFO] - asr websocket client start [2022-05-06 21:14:03,137] [ INFO] - asr websocket client start
[2022-05-06 21:14:03,137] [ INFO] - endpoint: ws://127.0.0.1:8390/paddlespeech/asr/streaming [2022-05-06 21:14:03,137] [ INFO] - endpoint: ws://127.0.0.1:8390/paddlespeech/asr/streaming
[2022-05-06 21:14:03,149] [ INFO] - client receive msg={"status": "ok", "signal": "server_ready"} [2022-05-06 21:14:03,149] [ INFO] - client receive msg={"status": "ok", "signal": "server_ready"}
...@@ -297,34 +300,29 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ...@@ -297,34 +300,29 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
[2022-05-06 21:14:12,159] [ INFO] - audio duration: 4.9968125, elapsed time: 9.019973039627075, RTF=1.8051453881103354 [2022-05-06 21:14:12,159] [ INFO] - audio duration: 4.9968125, elapsed time: 9.019973039627075, RTF=1.8051453881103354
[2022-05-06 21:14:12,160] [ INFO] - asr websocket client finished [2022-05-06 21:14:12,160] [ INFO] - asr websocket client finished
``` ```
## 标点预测 ## 标点预测
### 1. 服务端使用方法 ### 1. 服务端使用方法
- 命令行 - 命令行
**注意:** 默认部署在 `cpu` 设备上,可以通过修改服务配置文件中 `device` 参数部署在 `gpu` 上。 **注意:** 默认部署在 `cpu` 设备上,可以通过修改服务配置文件中 `device` 参数部署在 `gpu` 上。
``` bash ```bash
在 PaddleSpeech/demos/streaming_asr_server 目录下启动标点预测服务 # 在 PaddleSpeech/demos/streaming_asr_server 目录下启动标点预测服务
paddlespeech_server start --config_file conf/punc_application.yaml paddlespeech_server start --config_file conf/punc_application.yaml
``` ```
使用方法:
使用方法:
```bash ```bash
paddlespeech_server start --help paddlespeech_server start --help
``` ```
参数 参数:
- `config_file`: 服务的配置文件。 - `config_file`: 服务的配置文件。
- `log_file`: log 文件。 - `log_file`: log 文件。
输出 输出:
``` bash ```text
[2022-05-02 17:59:26,285] [ INFO] - Create the TextEngine Instance [2022-05-02 17:59:26,285] [ INFO] - Create the TextEngine Instance
[2022-05-02 17:59:26,285] [ INFO] - Init the text engine [2022-05-02 17:59:26,285] [ INFO] - Init the text engine
[2022-05-02 17:59:26,285] [ INFO] - Text Engine set the device: gpu:0 [2022-05-02 17:59:26,285] [ INFO] - Text Engine set the device: gpu:0
...@@ -356,8 +354,8 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ...@@ -356,8 +354,8 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
log_file="./log/paddlespeech.log") log_file="./log/paddlespeech.log")
``` ```
输出 输出:
``` ```text
[2022-05-02 18:09:02,542] [ INFO] - Create the TextEngine Instance [2022-05-02 18:09:02,542] [ INFO] - Create the TextEngine Instance
[2022-05-02 18:09:02,543] [ INFO] - Init the text engine [2022-05-02 18:09:02,543] [ INFO] - Init the text engine
[2022-05-02 18:09:02,543] [ INFO] - Text Engine set the device: gpu:0 [2022-05-02 18:09:02,543] [ INFO] - Text Engine set the device: gpu:0
...@@ -384,17 +382,17 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ...@@ -384,17 +382,17 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
若 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址 若 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址
``` ```bash
paddlespeech_client text --server_ip 127.0.0.1 --port 8190 --input "我认为跑步最重要的就是给我带来了身体健康" paddlespeech_client text --server_ip 127.0.0.1 --port 8190 --input "我认为跑步最重要的就是给我带来了身体健康"
``` ```
输出 输出:
``` ```text
[2022-05-02 18:12:29,767] [ INFO] - The punc text: 我认为跑步最重要的就是给我带来了身体健康。 [2022-05-02 18:12:29,767] [ INFO] - The punc text: 我认为跑步最重要的就是给我带来了身体健康。
[2022-05-02 18:12:29,767] [ INFO] - Response time 0.096548 s. [2022-05-02 18:12:29,767] [ INFO] - Response time 0.096548 s.
``` ```
- Python3 API - Python API
```python ```python
from paddlespeech.server.bin.paddlespeech_client import TextClientExecutor from paddlespeech.server.bin.paddlespeech_client import TextClientExecutor
...@@ -407,12 +405,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ...@@ -407,12 +405,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
print(res) print(res)
``` ```
输出 输出:
``` bash ```text
我认为跑步最重要的就是给我带来了身体健康。 我认为跑步最重要的就是给我带来了身体健康。
``` ```
## 联合流式语音识别和标点预测 ## 联合流式语音识别和标点预测
**注意:** 默认部署在 `cpu` 设备上,可以通过修改服务配置文件中 `device` 参数将语音识别和标点预测部署在不同的 `gpu` 上。 **注意:** 默认部署在 `cpu` 设备上,可以通过修改服务配置文件中 `device` 参数将语音识别和标点预测部署在不同的 `gpu` 上。
...@@ -420,7 +417,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ...@@ -420,7 +417,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
### 1. 启动服务 ### 1. 启动服务
``` bash ```bash
注意:流式语音识别和标点预测通过配置文件配置到不同的显卡上 注意:流式语音识别和标点预测通过配置文件配置到不同的显卡上
bash server.sh bash server.sh
``` ```
...@@ -430,11 +427,11 @@ bash server.sh ...@@ -430,11 +427,11 @@ bash server.sh
若 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址 若 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址
``` ```bash
paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8290 --punc.server_ip 127.0.0.1 --punc.port 8190 --input ./zh.wav paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8290 --punc.server_ip 127.0.0.1 --punc.port 8190 --input ./zh.wav
``` ```
输出 输出:
``` ```text
[2022-05-07 11:21:47,060] [ INFO] - asr websocket client start [2022-05-07 11:21:47,060] [ INFO] - asr websocket client start
[2022-05-07 11:21:47,060] [ INFO] - endpoint: ws://127.0.0.1:8490/paddlespeech/asr/streaming [2022-05-07 11:21:47,060] [ INFO] - endpoint: ws://127.0.0.1:8490/paddlespeech/asr/streaming
[2022-05-07 11:21:47,080] [ INFO] - client receive msg={"status": "ok", "signal": "server_ready"} [2022-05-07 11:21:47,080] [ INFO] - client receive msg={"status": "ok", "signal": "server_ready"}
...@@ -508,11 +505,11 @@ bash server.sh ...@@ -508,11 +505,11 @@ bash server.sh
若 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址 若 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址
``` ```bash
python3 websocket_client.py --server_ip 127.0.0.1 --port 8290 --punc.server_ip 127.0.0.1 --punc.port 8190 --wavfile ./zh.wav python3 websocket_client.py --server_ip 127.0.0.1 --port 8290 --punc.server_ip 127.0.0.1 --punc.port 8190 --wavfile ./zh.wav
``` ```
输出 输出:
``` ```text
[2022-05-07 11:11:02,984] [ INFO] - Start to do streaming asr client [2022-05-07 11:11:02,984] [ INFO] - Start to do streaming asr client
[2022-05-07 11:11:02,985] [ INFO] - asr websocket client start [2022-05-07 11:11:02,985] [ INFO] - asr websocket client start
[2022-05-07 11:11:02,985] [ INFO] - endpoint: ws://127.0.0.1:8490/paddlespeech/asr/streaming [2022-05-07 11:11:02,985] [ INFO] - endpoint: ws://127.0.0.1:8490/paddlespeech/asr/streaming
...@@ -581,5 +578,3 @@ bash server.sh ...@@ -581,5 +578,3 @@ bash server.sh
[2022-05-07 11:11:18,915] [ INFO] - audio duration: 4.9968125, elapsed time: 15.928460597991943, RTF=3.187724293835709 [2022-05-07 11:11:18,915] [ INFO] - audio duration: 4.9968125, elapsed time: 15.928460597991943, RTF=3.187724293835709
[2022-05-07 11:11:18,916] [ INFO] - asr websocket client finished : 我认为跑步最重要的就是给我带来了身体健康 [2022-05-07 11:11:18,916] [ INFO] - asr websocket client finished : 我认为跑步最重要的就是给我带来了身体健康
``` ```
...@@ -5,15 +5,19 @@ ...@@ -5,15 +5,19 @@
## Introduction ## Introduction
This demo is an implementation of starting the streaming speech synthesis service and accessing the service. It can be achieved with a single command using `paddlespeech_server` and `paddlespeech_client` or a few lines of code in python. This demo is an implementation of starting the streaming speech synthesis service and accessing the service. It can be achieved with a single command using `paddlespeech_server` and `paddlespeech_client` or a few lines of code in python.
For service interface definition, please check:
- [PaddleSpeech Server RESTful API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-RESTful-API)
- [PaddleSpeech Streaming Server WebSocket API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-WebSocket-API)
## Usage ## Usage
### 1. Installation ### 1. Installation
see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
It is recommended to use **paddlepaddle 2.2.2** or above. It is recommended to use **paddlepaddle 2.3.1** or above.
You can choose one way from easy, meduim and hard to install paddlespeech. You can choose one way from easy, meduim and hard to install paddlespeech.
**If you install in simple mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.**
**If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.**
### 2. Prepare config File ### 2. Prepare config File
The configuration file can be found in `conf/tts_online_application.yaml`. The configuration file can be found in `conf/tts_online_application.yaml`.
...@@ -29,11 +33,10 @@ The configuration file can be found in `conf/tts_online_application.yaml`. ...@@ -29,11 +33,10 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
- Both hifigan and mb_melgan support streaming voc inference. - Both hifigan and mb_melgan support streaming voc inference.
- When the voc model is mb_melgan, when voc_pad=14, the synthetic audio for streaming inference is consistent with the non-streaming synthetic audio; the minimum voc_pad can be set to 7, and the synthetic audio has no abnormal hearing. If the voc_pad is less than 7, the synthetic audio sounds abnormal. - When the voc model is mb_melgan, when voc_pad=14, the synthetic audio for streaming inference is consistent with the non-streaming synthetic audio; the minimum voc_pad can be set to 7, and the synthetic audio has no abnormal hearing. If the voc_pad is less than 7, the synthetic audio sounds abnormal.
- When the voc model is hifigan, when voc_pad=19, the streaming inference synthetic audio is consistent with the non-streaming synthetic audio; when voc_pad=14, the synthetic audio has no abnormal hearing. - When the voc model is hifigan, when voc_pad=19, the streaming inference synthetic audio is consistent with the non-streaming synthetic audio; when voc_pad=14, the synthetic audio has no abnormal hearing.
- Pad calculation method of streaming vocoder in PaddleSpeech: [AIStudio tutorial](https://aistudio.baidu.com/aistudio/projectdetail/4151335)
- Inference speed: mb_melgan > hifigan; Audio quality: mb_melgan < hifigan - Inference speed: mb_melgan > hifigan; Audio quality: mb_melgan < hifigan
- **Note:** If the service can be started normally in the container, but the client access IP is unreachable, you can try to replace the `host` address in the configuration file with the local IP address. - **Note:** If the service can be started normally in the container, but the client access IP is unreachable, you can try to replace the `host` address in the configuration file with the local IP address.
### 3. Streaming speech synthesis server and client using http protocol ### 3. Streaming speech synthesis server and client using http protocol
#### 3.1 Server Usage #### 3.1 Server Usage
- Command Line (Recommended) - Command Line (Recommended)
...@@ -53,7 +56,7 @@ The configuration file can be found in `conf/tts_online_application.yaml`. ...@@ -53,7 +56,7 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
- `log_file`: log file. Default: ./log/paddlespeech.log - `log_file`: log file. Default: ./log/paddlespeech.log
Output: Output:
```bash ```text
[2022-04-24 20:05:27,887] [ INFO] - The first response time of the 0 warm up: 1.0123658180236816 s [2022-04-24 20:05:27,887] [ INFO] - The first response time of the 0 warm up: 1.0123658180236816 s
[2022-04-24 20:05:28,038] [ INFO] - The first response time of the 1 warm up: 0.15108466148376465 s [2022-04-24 20:05:28,038] [ INFO] - The first response time of the 1 warm up: 0.15108466148376465 s
[2022-04-24 20:05:28,191] [ INFO] - The first response time of the 2 warm up: 0.15317344665527344 s [2022-04-24 20:05:28,191] [ INFO] - The first response time of the 2 warm up: 0.15317344665527344 s
...@@ -80,7 +83,7 @@ The configuration file can be found in `conf/tts_online_application.yaml`. ...@@ -80,7 +83,7 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
``` ```
Output: Output:
```bash ```text
[2022-04-24 21:00:16,934] [ INFO] - The first response time of the 0 warm up: 1.268730878829956 s [2022-04-24 21:00:16,934] [ INFO] - The first response time of the 0 warm up: 1.268730878829956 s
[2022-04-24 21:00:17,046] [ INFO] - The first response time of the 1 warm up: 0.11168622970581055 s [2022-04-24 21:00:17,046] [ INFO] - The first response time of the 1 warm up: 0.11168622970581055 s
[2022-04-24 21:00:17,151] [ INFO] - The first response time of the 2 warm up: 0.10413002967834473 s [2022-04-24 21:00:17,151] [ INFO] - The first response time of the 2 warm up: 0.10413002967834473 s
...@@ -93,8 +96,6 @@ The configuration file can be found in `conf/tts_online_application.yaml`. ...@@ -93,8 +96,6 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
[2022-04-24 21:00:17] [INFO] [on.py:59] Application startup complete. [2022-04-24 21:00:17] [INFO] [on.py:59] Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
[2022-04-24 21:00:17] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) [2022-04-24 21:00:17] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
``` ```
#### 3.2 Streaming TTS client Usage #### 3.2 Streaming TTS client Usage
...@@ -125,7 +126,7 @@ The configuration file can be found in `conf/tts_online_application.yaml`. ...@@ -125,7 +126,7 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
- Currently, only the single-speaker model is supported in the code, so `spk_id` does not take effect. Streaming TTS does not support changing sample rate, variable speed and volume. - Currently, only the single-speaker model is supported in the code, so `spk_id` does not take effect. Streaming TTS does not support changing sample rate, variable speed and volume.
Output: Output:
```bash ```text
[2022-04-24 21:08:18,559] [ INFO] - tts http client start [2022-04-24 21:08:18,559] [ INFO] - tts http client start
[2022-04-24 21:08:21,702] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 [2022-04-24 21:08:21,702] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。
[2022-04-24 21:08:21,703] [ INFO] - 首包响应:0.18863153457641602 s [2022-04-24 21:08:21,703] [ INFO] - 首包响应:0.18863153457641602 s
...@@ -154,7 +155,7 @@ The configuration file can be found in `conf/tts_online_application.yaml`. ...@@ -154,7 +155,7 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
``` ```
Output: Output:
```bash ```text
[2022-04-24 21:11:13,798] [ INFO] - tts http client start [2022-04-24 21:11:13,798] [ INFO] - tts http client start
[2022-04-24 21:11:16,800] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 [2022-04-24 21:11:16,800] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。
[2022-04-24 21:11:16,801] [ INFO] - 首包响应:0.18234872817993164 s [2022-04-24 21:11:16,801] [ INFO] - 首包响应:0.18234872817993164 s
...@@ -164,7 +165,6 @@ The configuration file can be found in `conf/tts_online_application.yaml`. ...@@ -164,7 +165,6 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
[2022-04-24 21:11:16,837] [ INFO] - 音频保存至:./output.wav [2022-04-24 21:11:16,837] [ INFO] - 音频保存至:./output.wav
``` ```
### 4. Streaming speech synthesis server and client using websocket protocol ### 4. Streaming speech synthesis server and client using websocket protocol
#### 4.1 Server Usage #### 4.1 Server Usage
- Command Line (Recommended) - Command Line (Recommended)
...@@ -184,7 +184,7 @@ The configuration file can be found in `conf/tts_online_application.yaml`. ...@@ -184,7 +184,7 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
- `log_file`: log file. Default: ./log/paddlespeech.log - `log_file`: log file. Default: ./log/paddlespeech.log
Output: Output:
```bash ```text
[2022-04-27 10:18:09,107] [ INFO] - The first response time of the 0 warm up: 1.1551103591918945 s [2022-04-27 10:18:09,107] [ INFO] - The first response time of the 0 warm up: 1.1551103591918945 s
[2022-04-27 10:18:09,219] [ INFO] - The first response time of the 1 warm up: 0.11204338073730469 s [2022-04-27 10:18:09,219] [ INFO] - The first response time of the 1 warm up: 0.11204338073730469 s
[2022-04-27 10:18:09,324] [ INFO] - The first response time of the 2 warm up: 0.1051797866821289 s [2022-04-27 10:18:09,324] [ INFO] - The first response time of the 2 warm up: 0.1051797866821289 s
...@@ -197,8 +197,6 @@ The configuration file can be found in `conf/tts_online_application.yaml`. ...@@ -197,8 +197,6 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
[2022-04-27 10:18:09] [INFO] [on.py:59] Application startup complete. [2022-04-27 10:18:09] [INFO] [on.py:59] Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
[2022-04-27 10:18:09] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) [2022-04-27 10:18:09] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
``` ```
- Python API - Python API
...@@ -212,7 +210,7 @@ The configuration file can be found in `conf/tts_online_application.yaml`. ...@@ -212,7 +210,7 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
``` ```
Output: Output:
```bash ```text
[2022-04-27 10:20:16,660] [ INFO] - The first response time of the 0 warm up: 1.0945196151733398 s [2022-04-27 10:20:16,660] [ INFO] - The first response time of the 0 warm up: 1.0945196151733398 s
[2022-04-27 10:20:16,773] [ INFO] - The first response time of the 1 warm up: 0.11222052574157715 s [2022-04-27 10:20:16,773] [ INFO] - The first response time of the 1 warm up: 0.11222052574157715 s
[2022-04-27 10:20:16,878] [ INFO] - The first response time of the 2 warm up: 0.10494542121887207 s [2022-04-27 10:20:16,878] [ INFO] - The first response time of the 2 warm up: 0.10494542121887207 s
...@@ -225,7 +223,6 @@ The configuration file can be found in `conf/tts_online_application.yaml`. ...@@ -225,7 +223,6 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
[2022-04-27 10:20:16] [INFO] [on.py:59] Application startup complete. [2022-04-27 10:20:16] [INFO] [on.py:59] Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
[2022-04-27 10:20:16] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) [2022-04-27 10:20:16] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
``` ```
#### 4.2 Streaming TTS client Usage #### 4.2 Streaming TTS client Usage
...@@ -258,7 +255,7 @@ The configuration file can be found in `conf/tts_online_application.yaml`. ...@@ -258,7 +255,7 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
Output: Output:
```bash ```text
[2022-04-27 10:21:04,262] [ INFO] - tts websocket client start [2022-04-27 10:21:04,262] [ INFO] - tts websocket client start
[2022-04-27 10:21:04,496] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 [2022-04-27 10:21:04,496] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。
[2022-04-27 10:21:04,496] [ INFO] - 首包响应:0.2124948501586914 s [2022-04-27 10:21:04,496] [ INFO] - 首包响应:0.2124948501586914 s
...@@ -266,7 +263,6 @@ The configuration file can be found in `conf/tts_online_application.yaml`. ...@@ -266,7 +263,6 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
[2022-04-27 10:21:07,484] [ INFO] - 音频时长:3.825 s [2022-04-27 10:21:07,484] [ INFO] - 音频时长:3.825 s
[2022-04-27 10:21:07,484] [ INFO] - RTF: 0.8363677006141812 [2022-04-27 10:21:07,484] [ INFO] - RTF: 0.8363677006141812
[2022-04-27 10:21:07,516] [ INFO] - 音频保存至:output.wav [2022-04-27 10:21:07,516] [ INFO] - 音频保存至:output.wav
``` ```
- Python API - Python API
...@@ -283,11 +279,10 @@ The configuration file can be found in `conf/tts_online_application.yaml`. ...@@ -283,11 +279,10 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
spk_id=0, spk_id=0,
output="./output.wav", output="./output.wav",
play=False) play=False)
``` ```
Output: Output:
```bash ```text
[2022-04-27 10:22:48,852] [ INFO] - tts websocket client start [2022-04-27 10:22:48,852] [ INFO] - tts websocket client start
[2022-04-27 10:22:49,080] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 [2022-04-27 10:22:49,080] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。
[2022-04-27 10:22:49,080] [ INFO] - 首包响应:0.21017956733703613 s [2022-04-27 10:22:49,080] [ INFO] - 首包响应:0.21017956733703613 s
...@@ -295,9 +290,4 @@ The configuration file can be found in `conf/tts_online_application.yaml`. ...@@ -295,9 +290,4 @@ The configuration file can be found in `conf/tts_online_application.yaml`.
[2022-04-27 10:22:52,101] [ INFO] - 音频时长:3.825 s [2022-04-27 10:22:52,101] [ INFO] - 音频时长:3.825 s
[2022-04-27 10:22:52,101] [ INFO] - RTF: 0.8445606356352762 [2022-04-27 10:22:52,101] [ INFO] - RTF: 0.8445606356352762
[2022-04-27 10:22:52,134] [ INFO] - 音频保存至:./output.wav [2022-04-27 10:22:52,134] [ INFO] - 音频保存至:./output.wav
``` ```
...@@ -3,15 +3,19 @@ ...@@ -3,15 +3,19 @@
# 流式语音合成服务 # 流式语音合成服务
## 介绍 ## 介绍
这个demo是一个启动流式语音合成服务和访问该服务的实现。 它可以通过使用`paddlespeech_server``paddlespeech_client`的单个命令或 python 的几行代码来实现。 这个 demo 是一个启动流式语音合成服务和访问该服务的实现。 它可以通过使用 `paddlespeech_server``paddlespeech_client` 的单个命令或 python 的几行代码来实现。
服务接口定义请参考:
- [PaddleSpeech Server RESTful API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-RESTful-API)
- [PaddleSpeech Streaming Server WebSocket API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-WebSocket-API)
## 使用方法 ## 使用方法
### 1. 安装 ### 1. 安装
请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). 请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
推荐使用 **paddlepaddle 2.2.2** 或以上版本。 推荐使用 **paddlepaddle 2.3.1** 或以上版本。
你可以从简单,中等,困难几种方式中选择一种方式安装 PaddleSpeech。
你可以从简单,中等,困难 几种方式中选择一种方式安装 PaddleSpeech。
**如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。** **如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。**
### 2. 准备配置文件 ### 2. 准备配置文件
...@@ -20,19 +24,20 @@ ...@@ -20,19 +24,20 @@
- `engine_list` 表示即将启动的服务将会包含的语音引擎,格式为 <语音任务>_<引擎类型> - `engine_list` 表示即将启动的服务将会包含的语音引擎,格式为 <语音任务>_<引擎类型>
- 该 demo 主要介绍流式语音合成服务,因此语音任务应设置为 tts。 - 该 demo 主要介绍流式语音合成服务,因此语音任务应设置为 tts。
- 目前引擎类型支持两种形式:**online** 表示使用python进行动态图推理的引擎;**online-onnx** 表示使用 onnxruntime 进行推理的引擎。其中,online-onnx 的推理速度更快。 - 目前引擎类型支持两种形式:**online** 表示使用python进行动态图推理的引擎;**online-onnx** 表示使用 onnxruntime 进行推理的引擎。其中,online-onnx 的推理速度更快。
- 流式 TTS 引擎的 AM 模型支持:**fastspeech2 以及fastspeech2_cnndecoder**; Voc 模型支持:**hifigan, mb_melgan** - 流式 TTS 引擎的 AM 模型支持:**fastspeech2 以及 fastspeech2_cnndecoder**; Voc 模型支持:**hifigan, mb_melgan**
- 流式 am 推理中,每次会对一个 chunk 的数据进行推理以达到流式的效果。其中 `am_block` 表示 chunk 中的有效帧数,`am_pad` 表示一个 chunk 中 am_block 前后各加的帧数。am_pad 的存在用于消除流式推理产生的误差,避免由流式推理对合成音频质量的影响。 - 流式 am 推理中,每次会对一个 chunk 的数据进行推理以达到流式的效果。其中 `am_block` 表示 chunk 中的有效帧数,`am_pad` 表示一个 chunk 中 am_block 前后各加的帧数。am_pad 的存在用于消除流式推理产生的误差,避免由流式推理对合成音频质量的影响。
- fastspeech2 不支持流式 am 推理,因此 am_pad 与 m_block 对它无效 - fastspeech2 不支持流式 am 推理,因此 am_pad 与 am_block 对它无效
- fastspeech2_cnndecoder 支持流式推理,当 am_pad=12 时,流式推理合成音频与非流式合成音频一致 - fastspeech2_cnndecoder 支持流式推理,当 am_pad=12 时,流式推理合成音频与非流式合成音频一致
- 流式 voc 推理中,每次会对一个 chunk 的数据进行推理以达到流式的效果。其中 `voc_block` 表示chunk中的有效帧数,`voc_pad` 表示一个 chunk 中 voc_block 前后各加的帧数。voc_pad 的存在用于消除流式推理产生的误差,避免由流式推理对合成音频质量的影响。 - 流式 voc 推理中,每次会对一个 chunk 的数据进行推理以达到流式的效果。其中 `voc_block` 表示 chunk 中的有效帧数,`voc_pad` 表示一个 chunk 中 voc_block 前后各加的帧数。voc_pad 的存在用于消除流式推理产生的误差,避免由流式推理对合成音频质量的影响。
- hifigan, mb_melgan 均支持流式 voc 推理 - hifigan, mb_melgan 均支持流式 voc 推理
- 当 voc 模型为 mb_melgan,当 voc_pad=14 时,流式推理合成音频与非流式合成音频一致;voc_pad 最小可以设置为7,合成音频听感上没有异常,若 voc_pad 小于7,合成音频听感上存在异常。 - 当 voc 模型为 mb_melgan,当 voc_pad=14 时,流式推理合成音频与非流式合成音频一致;voc_pad 最小可以设置为7,合成音频听感上没有异常,若 voc_pad 小于7,合成音频听感上存在异常。
- 当 voc 模型为 hifigan,当 voc_pad=19 时,流式推理合成音频与非流式合成音频一致;当 voc_pad=14 时,合成音频听感上没有异常。 - 当 voc 模型为 hifigan,当 voc_pad=19 时,流式推理合成音频与非流式合成音频一致;当 voc_pad=14 时,合成音频听感上没有异常。
- PaddleSpeech 中流式声码器 Pad 计算方法: [AIStudio 教程](https://aistudio.baidu.com/aistudio/projectdetail/4151335)
- 推理速度:mb_melgan > hifigan; 音频质量:mb_melgan < hifigan - 推理速度:mb_melgan > hifigan; 音频质量:mb_melgan < hifigan
- **注意:** 如果在容器里可正常启动服务,但客户端访问 ip 不可达,可尝试将配置文件中 `host` 地址换成本地 ip 地址。 - **注意:** 如果在容器里可正常启动服务,但客户端访问 ip 不可达,可尝试将配置文件中 `host` 地址换成本地 ip 地址。
### 3. 使用http协议的流式语音合成服务端及客户端使用方法 ### 3. 使用 http 协议的流式语音合成服务端及客户端使用方法
#### 3.1 服务端使用方法 #### 3.1 服务端使用方法
- 命令行 (推荐使用) - 命令行 (推荐使用)
...@@ -51,7 +56,7 @@ ...@@ -51,7 +56,7 @@
- `log_file`: log 文件. 默认:./log/paddlespeech.log - `log_file`: log 文件. 默认:./log/paddlespeech.log
输出: 输出:
```bash ```text
[2022-04-24 20:05:27,887] [ INFO] - The first response time of the 0 warm up: 1.0123658180236816 s [2022-04-24 20:05:27,887] [ INFO] - The first response time of the 0 warm up: 1.0123658180236816 s
[2022-04-24 20:05:28,038] [ INFO] - The first response time of the 1 warm up: 0.15108466148376465 s [2022-04-24 20:05:28,038] [ INFO] - The first response time of the 1 warm up: 0.15108466148376465 s
[2022-04-24 20:05:28,191] [ INFO] - The first response time of the 2 warm up: 0.15317344665527344 s [2022-04-24 20:05:28,191] [ INFO] - The first response time of the 2 warm up: 0.15317344665527344 s
...@@ -64,7 +69,6 @@ ...@@ -64,7 +69,6 @@
[2022-04-24 20:05:28] [INFO] [on.py:59] Application startup complete. [2022-04-24 20:05:28] [INFO] [on.py:59] Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
[2022-04-24 20:05:28] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) [2022-04-24 20:05:28] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
``` ```
- Python API - Python API
...@@ -77,8 +81,8 @@ ...@@ -77,8 +81,8 @@
log_file="./log/paddlespeech.log") log_file="./log/paddlespeech.log")
``` ```
输出 输出:
```bash ```text
[2022-04-24 21:00:16,934] [ INFO] - The first response time of the 0 warm up: 1.268730878829956 s [2022-04-24 21:00:16,934] [ INFO] - The first response time of the 0 warm up: 1.268730878829956 s
[2022-04-24 21:00:17,046] [ INFO] - The first response time of the 1 warm up: 0.11168622970581055 s [2022-04-24 21:00:17,046] [ INFO] - The first response time of the 1 warm up: 0.11168622970581055 s
[2022-04-24 21:00:17,151] [ INFO] - The first response time of the 2 warm up: 0.10413002967834473 s [2022-04-24 21:00:17,151] [ INFO] - The first response time of the 2 warm up: 0.10413002967834473 s
...@@ -91,8 +95,6 @@ ...@@ -91,8 +95,6 @@
[2022-04-24 21:00:17] [INFO] [on.py:59] Application startup complete. [2022-04-24 21:00:17] [INFO] [on.py:59] Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
[2022-04-24 21:00:17] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) [2022-04-24 21:00:17] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
``` ```
#### 3.2 客户端使用方法 #### 3.2 客户端使用方法
...@@ -124,7 +126,7 @@ ...@@ -124,7 +126,7 @@
输出: 输出:
```bash ```text
[2022-04-24 21:08:18,559] [ INFO] - tts http client start [2022-04-24 21:08:18,559] [ INFO] - tts http client start
[2022-04-24 21:08:21,702] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 [2022-04-24 21:08:21,702] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。
[2022-04-24 21:08:21,703] [ INFO] - 首包响应:0.18863153457641602 s [2022-04-24 21:08:21,703] [ INFO] - 首包响应:0.18863153457641602 s
...@@ -163,8 +165,7 @@ ...@@ -163,8 +165,7 @@
[2022-04-24 21:11:16,837] [ INFO] - 音频保存至:./output.wav [2022-04-24 21:11:16,837] [ INFO] - 音频保存至:./output.wav
``` ```
### 4. 使用 websocket 协议的流式语音合成服务端及客户端使用方法
### 4. 使用websocket协议的流式语音合成服务端及客户端使用方法
#### 4.1 服务端使用方法 #### 4.1 服务端使用方法
- 命令行 (推荐使用) - 命令行 (推荐使用)
首先修改配置文件 `conf/tts_online_application.yaml`**将 `protocol` 设置为 `websocket`** 首先修改配置文件 `conf/tts_online_application.yaml`**将 `protocol` 设置为 `websocket`**
...@@ -183,7 +184,7 @@ ...@@ -183,7 +184,7 @@
- `log_file`: log 文件. 默认:./log/paddlespeech.log - `log_file`: log 文件. 默认:./log/paddlespeech.log
输出: 输出:
```bash ```text
[2022-04-27 10:18:09,107] [ INFO] - The first response time of the 0 warm up: 1.1551103591918945 s [2022-04-27 10:18:09,107] [ INFO] - The first response time of the 0 warm up: 1.1551103591918945 s
[2022-04-27 10:18:09,219] [ INFO] - The first response time of the 1 warm up: 0.11204338073730469 s [2022-04-27 10:18:09,219] [ INFO] - The first response time of the 1 warm up: 0.11204338073730469 s
[2022-04-27 10:18:09,324] [ INFO] - The first response time of the 2 warm up: 0.1051797866821289 s [2022-04-27 10:18:09,324] [ INFO] - The first response time of the 2 warm up: 0.1051797866821289 s
...@@ -196,8 +197,6 @@ ...@@ -196,8 +197,6 @@
[2022-04-27 10:18:09] [INFO] [on.py:59] Application startup complete. [2022-04-27 10:18:09] [INFO] [on.py:59] Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
[2022-04-27 10:18:09] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) [2022-04-27 10:18:09] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
``` ```
- Python API - Python API
...@@ -210,8 +209,8 @@ ...@@ -210,8 +209,8 @@
log_file="./log/paddlespeech.log") log_file="./log/paddlespeech.log")
``` ```
输出 输出:
```bash ```text
[2022-04-27 10:20:16,660] [ INFO] - The first response time of the 0 warm up: 1.0945196151733398 s [2022-04-27 10:20:16,660] [ INFO] - The first response time of the 0 warm up: 1.0945196151733398 s
[2022-04-27 10:20:16,773] [ INFO] - The first response time of the 1 warm up: 0.11222052574157715 s [2022-04-27 10:20:16,773] [ INFO] - The first response time of the 1 warm up: 0.11222052574157715 s
[2022-04-27 10:20:16,878] [ INFO] - The first response time of the 2 warm up: 0.10494542121887207 s [2022-04-27 10:20:16,878] [ INFO] - The first response time of the 2 warm up: 0.10494542121887207 s
...@@ -224,13 +223,12 @@ ...@@ -224,13 +223,12 @@
[2022-04-27 10:20:16] [INFO] [on.py:59] Application startup complete. [2022-04-27 10:20:16] [INFO] [on.py:59] Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
[2022-04-27 10:20:16] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) [2022-04-27 10:20:16] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit)
``` ```
#### 4.2 客户端使用方法 #### 4.2 客户端使用方法
- 命令行 (推荐使用) - 命令行 (推荐使用)
访问 websocket 流式TTS服务: 访问 websocket 流式 TTS 服务:
若 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址 若 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址
...@@ -255,9 +253,8 @@ ...@@ -255,9 +253,8 @@
- 目前代码中只支持单说话人的模型,因此 spk_id 的选择并不生效。流式 TTS 不支持更换采样率,变速和变音量等功能。 - 目前代码中只支持单说话人的模型,因此 spk_id 的选择并不生效。流式 TTS 不支持更换采样率,变速和变音量等功能。
输出: 输出:
```bash ```text
[2022-04-27 10:21:04,262] [ INFO] - tts websocket client start [2022-04-27 10:21:04,262] [ INFO] - tts websocket client start
[2022-04-27 10:21:04,496] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 [2022-04-27 10:21:04,496] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。
[2022-04-27 10:21:04,496] [ INFO] - 首包响应:0.2124948501586914 s [2022-04-27 10:21:04,496] [ INFO] - 首包响应:0.2124948501586914 s
...@@ -265,7 +262,6 @@ ...@@ -265,7 +262,6 @@
[2022-04-27 10:21:07,484] [ INFO] - 音频时长:3.825 s [2022-04-27 10:21:07,484] [ INFO] - 音频时长:3.825 s
[2022-04-27 10:21:07,484] [ INFO] - RTF: 0.8363677006141812 [2022-04-27 10:21:07,484] [ INFO] - RTF: 0.8363677006141812
[2022-04-27 10:21:07,516] [ INFO] - 音频保存至:output.wav [2022-04-27 10:21:07,516] [ INFO] - 音频保存至:output.wav
``` ```
- Python API - Python API
...@@ -282,11 +278,10 @@ ...@@ -282,11 +278,10 @@
spk_id=0, spk_id=0,
output="./output.wav", output="./output.wav",
play=False) play=False)
``` ```
输出: 输出:
```bash ```text
[2022-04-27 10:22:48,852] [ INFO] - tts websocket client start [2022-04-27 10:22:48,852] [ INFO] - tts websocket client start
[2022-04-27 10:22:49,080] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 [2022-04-27 10:22:49,080] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。
[2022-04-27 10:22:49,080] [ INFO] - 首包响应:0.21017956733703613 s [2022-04-27 10:22:49,080] [ INFO] - 首包响应:0.21017956733703613 s
...@@ -294,8 +289,4 @@ ...@@ -294,8 +289,4 @@
[2022-04-27 10:22:52,101] [ INFO] - 音频时长:3.825 s [2022-04-27 10:22:52,101] [ INFO] - 音频时长:3.825 s
[2022-04-27 10:22:52,101] [ INFO] - RTF: 0.8445606356352762 [2022-04-27 10:22:52,101] [ INFO] - RTF: 0.8445606356352762
[2022-04-27 10:22:52,134] [ INFO] - 音频保存至:./output.wav [2022-04-27 10:22:52,134] [ INFO] - 音频保存至:./output.wav
``` ```
FROM registry.baidubce.com/paddlepaddle/paddle:2.2.2 FROM registry.baidubce.com/paddlepaddle/paddle:2.2.2
LABEL maintainer="paddlesl@baidu.com" LABEL maintainer="paddlesl@baidu.com"
RUN apt-get update \
&& apt-get install libsndfile-dev \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
RUN git clone --depth 1 https://github.com/PaddlePaddle/PaddleSpeech.git /home/PaddleSpeech RUN git clone --depth 1 https://github.com/PaddlePaddle/PaddleSpeech.git /home/PaddleSpeech
RUN pip3 uninstall mccabe -y ; exit 0; RUN pip3 uninstall mccabe -y ; exit 0;
RUN pip3 install multiprocess==0.70.12 importlib-metadata==4.2.0 dill==0.3.4 RUN pip3 install multiprocess==0.70.12 importlib-metadata==4.2.0 dill==0.3.4
RUN cd /home/PaddleSpeech/audio WORKDIR /home/PaddleSpeech/
RUN python setup.py bdist_wheel
RUN cd /home/PaddleSpeech
RUN python setup.py bdist_wheel RUN python setup.py bdist_wheel
RUN pip install audio/dist/*.whl dist/*.whl RUN pip install dist/*.whl -i https://pypi.tuna.tsinghua.edu.cn/simple
WORKDIR /home/PaddleSpeech/ CMD ['bash']
...@@ -49,3 +49,4 @@ websockets ...@@ -49,3 +49,4 @@ websockets
keyboard keyboard
uvicorn uvicorn
pattern_singleton pattern_singleton
braceexpand
\ No newline at end of file
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
fs: 24000 # sr
n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
# Only used for feats_type != raw
fmin: 80 # Minimum frequency of Mel basis.
fmax: 7600 # Maximum frequency of Mel basis.
n_mels: 80 # The number of mel basis.
mean_phn_span: 8
mlm_prob: 0.8
###########################################################
# DATA SETTING #
###########################################################
batch_size: 20
num_workers: 2
###########################################################
# MODEL SETTING #
###########################################################
model:
text_masking: false
postnet_layers: 5
postnet_filts: 5
postnet_chans: 256
encoder_type: conformer
decoder_type: conformer
enc_input_layer: sega_mlm
enc_pre_speech_layer: 0
enc_cnn_module_kernel: 7
enc_attention_dim: 384
enc_attention_heads: 2
enc_linear_units: 1536
enc_num_blocks: 4
enc_dropout_rate: 0.2
enc_positional_dropout_rate: 0.2
enc_attention_dropout_rate: 0.2
enc_normalize_before: true
enc_macaron_style: true
enc_use_cnn_module: true
enc_selfattention_layer_type: legacy_rel_selfattn
enc_activation_type: swish
enc_pos_enc_layer_type: legacy_rel_pos
enc_positionwise_layer_type: conv1d
enc_positionwise_conv_kernel_size: 3
dec_cnn_module_kernel: 31
dec_attention_dim: 384
dec_attention_heads: 2
dec_linear_units: 1536
dec_num_blocks: 4
dec_dropout_rate: 0.2
dec_positional_dropout_rate: 0.2
dec_attention_dropout_rate: 0.2
dec_macaron_style: true
dec_use_cnn_module: true
dec_selfattention_layer_type: legacy_rel_selfattn
dec_activation_type: swish
dec_pos_enc_layer_type: legacy_rel_pos
dec_positionwise_layer_type: conv1d
dec_positionwise_conv_kernel_size: 3
###########################################################
# OPTIMIZER SETTING #
###########################################################
scheduler_params:
d_model: 384
warmup_steps: 4000
grad_clip: 1.0
###########################################################
# TRAINING SETTING #
###########################################################
max_epoch: 1500
num_snapshots: 50
###########################################################
# OTHER SETTING #
###########################################################
seed: 0
token_list:
- <blank>
- <unk>
- d
- sp
- sh
- ii
- j
- zh
- l
- x
- b
- g
- uu
- e5
- h
- q
- m
- i1
- t
- z
- ch
- f
- s
- u4
- ix4
- i4
- n
- i3
- iu3
- vv
- ian4
- ix2
- r
- e4
- ai4
- k
- ing2
- a1
- en2
- ui4
- ong1
- uo3
- u2
- u3
- ao4
- ee
- p
- an1
- eng2
- i2
- in1
- c
- ai2
- ian2
- e2
- an4
- ing4
- v4
- ai3
- a5
- ian3
- eng1
- ong4
- ang4
- ian1
- ing1
- iy4
- ao3
- ang1
- uo4
- u1
- iao4
- iu4
- a4
- van2
- ie4
- ang2
- ou4
- iang4
- ix1
- er4
- iy1
- e1
- en1
- ui2
- an3
- ei4
- ong2
- uo1
- ou3
- uo2
- iao1
- ou1
- an2
- uan4
- ia4
- ia1
- ang3
- v3
- iu2
- iao3
- in4
- a3
- ei3
- iang3
- v2
- eng4
- en3
- aa
- uan1
- v1
- ao1
- ve4
- ie3
- ai1
- ing3
- iang1
- a2
- ui1
- en4
- en5
- in3
- uan3
- e3
- ie1
- ve2
- ei2
- in2
- ix3
- uan2
- iang2
- ie2
- ua4
- ou2
- uai4
- er2
- eng3
- uang3
- un1
- ong3
- uang4
- vn4
- un2
- iy3
- iz4
- ui3
- iao2
- iong4
- un4
- van4
- ao2
- uang1
- iy5
- o2
- ei1
- ua1
- iu1
- uang2
- er5
- o1
- un3
- vn1
- vn2
- o4
- ve1
- van3
- ua2
- er3
- iong3
- van1
- ia2
- iy2
- ia3
- iong1
- uo5
- oo
- ve3
- ou5
- uai3
- ian5
- iong2
- uai2
- uai1
- ua3
- vn3
- ia5
- ie5
- ueng1
- o5
- o3
- iang5
- ei5
- <sos/eos>
\ No newline at end of file
#!/bin/bash
stage=0
stop_stage=100
config_path=$1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# get durations from MFA's result
echo "Generate durations.txt from MFA results ..."
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
--inputdir=./aishell3_alignment_tone \
--output durations.txt \
--config=${config_path}
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# extract features
echo "Extract features ..."
python3 ${BIN_DIR}/preprocess.py \
--dataset=aishell3 \
--rootdir=~/datasets/data_aishell3/ \
--dumpdir=dump \
--dur-file=durations.txt \
--config=${config_path} \
--num-cpu=20 \
--cut-sil=True
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# get features' stats(mean and std)
echo "Get features' stats ..."
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
--metadata=dump/train/raw/metadata.jsonl \
--field-name="speech"
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# normalize and covert phone/speaker to id, dev and test should use train's stats
echo "Normalize ..."
python3 ${BIN_DIR}/normalize.py \
--metadata=dump/train/raw/metadata.jsonl \
--dumpdir=dump/train/norm \
--speech-stats=dump/train/speech_stats.npy \
--phones-dict=dump/phone_id_map.txt \
--speaker-dict=dump/speaker_id_map.txt
python3 ${BIN_DIR}/normalize.py \
--metadata=dump/dev/raw/metadata.jsonl \
--dumpdir=dump/dev/norm \
--speech-stats=dump/train/speech_stats.npy \
--phones-dict=dump/phone_id_map.txt \
--speaker-dict=dump/speaker_id_map.txt
python3 ${BIN_DIR}/normalize.py \
--metadata=dump/test/raw/metadata.jsonl \
--dumpdir=dump/test/norm \
--speech-stats=dump/train/speech_stats.npy \
--phones-dict=dump/phone_id_map.txt \
--speaker-dict=dump/speaker_id_map.txt
fi
#!/bin/bash
config_path=$1
train_output_path=$2
ckpt_name=$3
stage=1
stop_stage=1
# pwgan
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize.py \
--erniesat_config=${config_path} \
--erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--erniesat_stat=dump/train/speech_stats.npy \
--voc=pwgan_aishell3 \
--voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
--voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
--voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt
fi
# hifigan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize.py \
--erniesat_config=${config_path} \
--erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--erniesat_stat=dump/train/speech_stats.npy \
--voc=hifigan_aishell3 \
--voc_config=hifigan_aishell3_ckpt_0.2.0/default.yaml \
--voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pdz \
--voc_stat=hifigan_aishell3_ckpt_0.2.0/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt
fi
#!/bin/bash
config_path=$1
train_output_path=$2
python3 ${BIN_DIR}/train.py \
--train-metadata=dump/train/norm/metadata.jsonl \
--dev-metadata=dump/dev/norm/metadata.jsonl \
--config=${config_path} \
--output-dir=${train_output_path} \
--ngpu=2 \
--phones-dict=dump/phone_id_map.txt
\ No newline at end of file
#!/bin/bash
export MAIN_ROOT=`realpath ${PWD}/../../../`
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
export PYTHONDONTWRITEBYTECODE=1
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
MODEL=ernie_sat
export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL}
\ No newline at end of file
#!/bin/bash
set -e
source path.sh
gpus=0,1
stage=0
stop_stage=100
conf_path=conf/default.yaml
train_output_path=exp/default
ckpt_name=snapshot_iter_153.pdz
# with the following command, you can choose the stage range you want to run
# such as `./run.sh --stage 0 --stop-stage 0`
# this can not be mixed use with `$1`, `$2` ...
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
./local/preprocess.sh ${conf_path} || exit -1
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# synthesize, vocoder is pwgan
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi
...@@ -37,7 +37,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then ...@@ -37,7 +37,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
--am_stat=dump/train/speech_stats.npy \ --am_stat=dump/train/speech_stats.npy \
--voc=hifigan_aishell3 \ --voc=hifigan_aishell3 \
--voc_config=hifigan_aishell3_ckpt_0.2.0/default.yaml \ --voc_config=hifigan_aishell3_ckpt_0.2.0/default.yaml \
--voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pd \ --voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pdz \
--voc_stat=hifigan_aishell3_ckpt_0.2.0/feats_stats.npy \ --voc_stat=hifigan_aishell3_ckpt_0.2.0/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \ --test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \ --output_dir=${train_output_path}/test \
......
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
fs: 24000 # sr
n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
# Only used for feats_type != raw
fmin: 80 # Minimum frequency of Mel basis.
fmax: 7600 # Maximum frequency of Mel basis.
n_mels: 80 # The number of mel basis.
mean_phn_span: 8
mlm_prob: 0.8
###########################################################
# DATA SETTING #
###########################################################
batch_size: 20
num_workers: 2
###########################################################
# MODEL SETTING #
###########################################################
model:
text_masking: true
postnet_layers: 5
postnet_filts: 5
postnet_chans: 256
encoder_type: conformer
decoder_type: conformer
enc_input_layer: sega_mlm
enc_pre_speech_layer: 0
enc_cnn_module_kernel: 7
enc_attention_dim: 384
enc_attention_heads: 2
enc_linear_units: 1536
enc_num_blocks: 4
enc_dropout_rate: 0.2
enc_positional_dropout_rate: 0.2
enc_attention_dropout_rate: 0.2
enc_normalize_before: true
enc_macaron_style: true
enc_use_cnn_module: true
enc_selfattention_layer_type: legacy_rel_selfattn
enc_activation_type: swish
enc_pos_enc_layer_type: legacy_rel_pos
enc_positionwise_layer_type: conv1d
enc_positionwise_conv_kernel_size: 3
dec_cnn_module_kernel: 31
dec_attention_dim: 384
dec_attention_heads: 2
dec_linear_units: 1536
dec_num_blocks: 4
dec_dropout_rate: 0.2
dec_positional_dropout_rate: 0.2
dec_attention_dropout_rate: 0.2
dec_macaron_style: true
dec_use_cnn_module: true
dec_selfattention_layer_type: legacy_rel_selfattn
dec_activation_type: swish
dec_pos_enc_layer_type: legacy_rel_pos
dec_positionwise_layer_type: conv1d
dec_positionwise_conv_kernel_size: 3
###########################################################
# OPTIMIZER SETTING #
###########################################################
scheduler_params:
d_model: 384
warmup_steps: 4000
grad_clip: 1.0
###########################################################
# TRAINING SETTING #
###########################################################
max_epoch: 700
num_snapshots: 50
###########################################################
# OTHER SETTING #
###########################################################
seed: 0
token_list:
- <blank>
- <unk>
- AH0
- T
- N
- sp
- S
- R
- D
- L
- Z
- DH
- IH1
- K
- W
- M
- EH1
- AE1
- ER0
- B
- IY1
- P
- V
- IY0
- F
- HH
- AA1
- AY1
- AH1
- EY1
- IH0
- AO1
- OW1
- UW1
- G
- NG
- SH
- Y
- TH
- ER1
- JH
- UH1
- AW1
- CH
- IH2
- OW0
- OW2
- EY2
- EH2
- UW0
- OY1
- ZH
- EH0
- AY2
- AW2
- AA2
- AE2
- IY2
- AH2
- AE0
- AO2
- AY0
- AO0
- UW2
- UH2
- AA0
- EY0
- AW0
- UH0
- ER2
- OY2
- OY0
- d
- sh
- ii
- j
- zh
- l
- x
- b
- g
- uu
- e5
- h
- q
- m
- i1
- t
- z
- ch
- f
- s
- u4
- ix4
- i4
- n
- i3
- iu3
- vv
- ian4
- ix2
- r
- e4
- ai4
- k
- ing2
- a1
- en2
- ui4
- ong1
- uo3
- u2
- u3
- ao4
- ee
- p
- an1
- eng2
- i2
- in1
- c
- ai2
- ian2
- e2
- an4
- ing4
- v4
- ai3
- a5
- ian3
- eng1
- ong4
- ang4
- ian1
- ing1
- iy4
- ao3
- ang1
- uo4
- u1
- iao4
- iu4
- a4
- van2
- ie4
- ang2
- ou4
- iang4
- ix1
- er4
- iy1
- e1
- en1
- ui2
- an3
- ei4
- ong2
- uo1
- ou3
- uo2
- iao1
- ou1
- an2
- uan4
- ia4
- ia1
- ang3
- v3
- iu2
- iao3
- in4
- a3
- ei3
- iang3
- v2
- eng4
- en3
- aa
- uan1
- v1
- ao1
- ve4
- ie3
- ai1
- ing3
- iang1
- a2
- ui1
- en4
- en5
- in3
- uan3
- e3
- ie1
- ve2
- ei2
- in2
- ix3
- uan2
- iang2
- ie2
- ua4
- ou2
- uai4
- er2
- eng3
- uang3
- un1
- ong3
- uang4
- vn4
- un2
- iy3
- iz4
- ui3
- iao2
- iong4
- un4
- van4
- ao2
- uang1
- iy5
- o2
- ei1
- ua1
- iu1
- uang2
- er5
- o1
- un3
- vn1
- vn2
- o4
- ve1
- van3
- ua2
- er3
- iong3
- van1
- ia2
- iy2
- ia3
- iong1
- uo5
- oo
- ve3
- ou5
- uai3
- ian5
- iong2
- uai2
- uai1
- ua3
- vn3
- ia5
- ie5
- ueng1
- o5
- o3
- iang5
- ei5
- <sos/eos>
#!/bin/bash
stage=0
stop_stage=100
config_path=$1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# get durations from MFA's result
echo "Generate durations.txt from MFA results for aishell3 ..."
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
--inputdir=./aishell3_alignment_tone \
--output durations_aishell3.txt \
--config=${config_path}
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# get durations from MFA's result
echo "Generate durations.txt from MFA results for vctk ..."
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
--inputdir=./vctk_alignment \
--output durations_vctk.txt \
--config=${config_path}
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# get durations from MFA's result
echo "concat durations_aishell3.txt and durations_vctk.txt to durations.txt"
cat durations_aishell3.txt durations_vctk.txt > durations.txt
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# extract features
echo "Extract features ..."
python3 ${BIN_DIR}/preprocess.py \
--dataset=aishell3 \
--rootdir=~/datasets/data_aishell3/ \
--dumpdir=dump \
--dur-file=durations.txt \
--config=${config_path} \
--num-cpu=20 \
--cut-sil=True
fi
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
# extract features
echo "Extract features ..."
python3 ${BIN_DIR}/preprocess.py \
--dataset=vctk \
--rootdir=~/datasets/VCTK-Corpus-0.92/ \
--dumpdir=dump \
--dur-file=durations.txt \
--config=${config_path} \
--num-cpu=20 \
--cut-sil=True
fi
if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
# get features' stats(mean and std)
echo "Get features' stats ..."
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
--metadata=dump/train/raw/metadata.jsonl \
--field-name="speech"
fi
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
# normalize and covert phone/speaker to id, dev and test should use train's stats
echo "Normalize ..."
python3 ${BIN_DIR}/normalize.py \
--metadata=dump/train/raw/metadata.jsonl \
--dumpdir=dump/train/norm \
--speech-stats=dump/train/speech_stats.npy \
--phones-dict=dump/phone_id_map.txt \
--speaker-dict=dump/speaker_id_map.txt
python3 ${BIN_DIR}/normalize.py \
--metadata=dump/dev/raw/metadata.jsonl \
--dumpdir=dump/dev/norm \
--speech-stats=dump/train/speech_stats.npy \
--phones-dict=dump/phone_id_map.txt \
--speaker-dict=dump/speaker_id_map.txt
python3 ${BIN_DIR}/normalize.py \
--metadata=dump/test/raw/metadata.jsonl \
--dumpdir=dump/test/norm \
--speech-stats=dump/train/speech_stats.npy \
--phones-dict=dump/phone_id_map.txt \
--speaker-dict=dump/speaker_id_map.txt
fi
#!/bin/bash
config_path=$1
train_output_path=$2
ckpt_name=$3
stage=1
stop_stage=1
# pwgan
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize.py \
--erniesat_config=${config_path} \
--erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--erniesat_stat=dump/train/speech_stats.npy \
--voc=pwgan_aishell3 \
--voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
--voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
--voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt
fi
# hifigan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize.py \
--erniesat_config=${config_path} \
--erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--erniesat_stat=dump/train/speech_stats.npy \
--voc=hifigan_aishell3 \
--voc_config=hifigan_aishell3_ckpt_0.2.0/default.yaml \
--voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pdz \
--voc_stat=hifigan_aishell3_ckpt_0.2.0/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt
fi
#!/bin/bash
config_path=$1
train_output_path=$2
python3 ${BIN_DIR}/train.py \
--train-metadata=dump/train/norm/metadata.jsonl \
--dev-metadata=dump/dev/norm/metadata.jsonl \
--config=${config_path} \
--output-dir=${train_output_path} \
--ngpu=2 \
--phones-dict=dump/phone_id_map.txt
\ No newline at end of file
#!/bin/bash
export MAIN_ROOT=`realpath ${PWD}/../../../`
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
export PYTHONDONTWRITEBYTECODE=1
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
MODEL=ernie_sat
export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL}
\ No newline at end of file
#!/bin/bash
set -e
source path.sh
gpus=0,1
stage=0
stop_stage=100
conf_path=conf/default.yaml
train_output_path=exp/default
ckpt_name=snapshot_iter_153.pdz
# with the following command, you can choose the stage range you want to run
# such as `./run.sh --stage 0 --stop-stage 0`
# this can not be mixed use with `$1`, `$2` ...
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
./local/preprocess.sh ${conf_path} || exit -1
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# synthesize, vocoder is pwgan
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi
...@@ -29,7 +29,7 @@ generator_params: ...@@ -29,7 +29,7 @@ generator_params:
out_channels: 4 # Number of output channels. out_channels: 4 # Number of output channels.
kernel_size: 7 # Kernel size of initial and final conv layers. kernel_size: 7 # Kernel size of initial and final conv layers.
channels: 384 # Initial number of channels for conv layers. channels: 384 # Initial number of channels for conv layers.
upsample_scales: [5, 5, 3] # List of Upsampling scales. prod(upsample_scales) == n_shift upsample_scales: [5, 5, 3] # List of Upsampling scales. prod(upsample_scales) x out_channels == n_shift
stack_kernel_size: 3 # Kernel size of dilated conv layers in residual stack. stack_kernel_size: 3 # Kernel size of dilated conv layers in residual stack.
stacks: 4 # Number of stacks in a single residual stack module. stacks: 4 # Number of stacks in a single residual stack module.
use_weight_norm: True # Whether to use weight normalization. use_weight_norm: True # Whether to use weight normalization.
......
...@@ -29,7 +29,7 @@ generator_params: ...@@ -29,7 +29,7 @@ generator_params:
out_channels: 4 # Number of output channels. out_channels: 4 # Number of output channels.
kernel_size: 7 # Kernel size of initial and final conv layers. kernel_size: 7 # Kernel size of initial and final conv layers.
channels: 384 # Initial number of channels for conv layers. channels: 384 # Initial number of channels for conv layers.
upsample_scales: [5, 5, 3] # List of Upsampling scales. prod(upsample_scales) == n_shift upsample_scales: [5, 5, 3] # List of Upsampling scales. prod(upsample_scales) x out_channels == n_shift
stack_kernel_size: 3 # Kernel size of dilated conv layers in residual stack. stack_kernel_size: 3 # Kernel size of dilated conv layers in residual stack.
stacks: 4 # Number of stacks in a single residual stack module. stacks: 4 # Number of stacks in a single residual stack module.
use_weight_norm: True # Whether to use weight normalization. use_weight_norm: True # Whether to use weight normalization.
......
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Usage: """ Usage:
align.py wavfile trsfile outwordfile outphonefile align.py wavfile trsfile outwordfile outphonefile
""" """
......
#!/usr/bin/env python3 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os import os
import random import random
from typing import Dict from typing import Dict
...@@ -305,7 +317,6 @@ def get_dur_adj_factor(orig_dur: List[int], ...@@ -305,7 +317,6 @@ def get_dur_adj_factor(orig_dur: List[int],
def prep_feats_with_dur(wav_path: str, def prep_feats_with_dur(wav_path: str,
mlm_model: nn.Layer,
source_lang: str="English", source_lang: str="English",
target_lang: str="English", target_lang: str="English",
old_str: str="", old_str: str="",
...@@ -425,8 +436,7 @@ def prep_feats_with_dur(wav_path: str, ...@@ -425,8 +436,7 @@ def prep_feats_with_dur(wav_path: str,
return new_wav, new_phns, new_mfa_start, new_mfa_end, old_span_bdy, new_span_bdy return new_wav, new_phns, new_mfa_start, new_mfa_end, old_span_bdy, new_span_bdy
def prep_feats(mlm_model: nn.Layer, def prep_feats(wav_path: str,
wav_path: str,
source_lang: str="english", source_lang: str="english",
target_lang: str="english", target_lang: str="english",
old_str: str="", old_str: str="",
...@@ -440,7 +450,6 @@ def prep_feats(mlm_model: nn.Layer, ...@@ -440,7 +450,6 @@ def prep_feats(mlm_model: nn.Layer,
wav, phns, mfa_start, mfa_end, old_span_bdy, new_span_bdy = prep_feats_with_dur( wav, phns, mfa_start, mfa_end, old_span_bdy, new_span_bdy = prep_feats_with_dur(
source_lang=source_lang, source_lang=source_lang,
target_lang=target_lang, target_lang=target_lang,
mlm_model=mlm_model,
old_str=old_str, old_str=old_str,
new_str=new_str, new_str=new_str,
wav_path=wav_path, wav_path=wav_path,
...@@ -482,7 +491,6 @@ def decode_with_model(mlm_model: nn.Layer, ...@@ -482,7 +491,6 @@ def decode_with_model(mlm_model: nn.Layer,
batch, old_span_bdy, new_span_bdy = prep_feats( batch, old_span_bdy, new_span_bdy = prep_feats(
source_lang=source_lang, source_lang=source_lang,
target_lang=target_lang, target_lang=target_lang,
mlm_model=mlm_model,
wav_path=wav_path, wav_path=wav_path,
old_str=old_str, old_str=old_str,
new_str=new_str, new_str=new_str,
......
此差异已折叠。
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse import argparse
......
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from pathlib import Path from pathlib import Path
from typing import Dict from typing import Dict
from typing import List from typing import List
......
#!/bin/bash
set -e
source path.sh
# en --> zh 的 语音合成
# 根据 Prompt_003_new 作为提示语音: This was not the show for me. 来合成: '今天天气很好'
# 注: 输入的 new_str 需为中文汉字, 否则会通过预处理只保留中文汉字, 即合成预处理后的中文语音。
python local/inference_new.py \
--task_name=cross-lingual_clone \
--model_name=paddle_checkpoint_dual_mask_enzh \
--uid=Prompt_003_new \
--new_str='今天天气很好.' \
--prefix='./prompt/dev/' \
--source_lang=english \
--target_lang=chinese \
--output_name=pred_clone.wav \
--voc=pwgan_aishell3 \
--voc_config=download/pwg_aishell3_ckpt_0.5/default.yaml \
--voc_ckpt=download/pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
--voc_stat=download/pwg_aishell3_ckpt_0.5/feats_stats.npy \
--am=fastspeech2_csmsc \
--am_config=download/fastspeech2_conformer_baker_ckpt_0.5/conformer.yaml \
--am_ckpt=download/fastspeech2_conformer_baker_ckpt_0.5/snapshot_iter_76000.pdz \
--am_stat=download/fastspeech2_conformer_baker_ckpt_0.5/speech_stats.npy \
--phones_dict=download/fastspeech2_conformer_baker_ckpt_0.5/phone_id_map.txt
#!/bin/bash
set -e
source path.sh
# 纯英文的语音合成
# 样例为根据 p299_096 对应的语音作为提示语音: This was not the show for me. 来合成: 'I enjoy my life.'
python local/inference_new.py \
--task_name=synthesize \
--model_name=paddle_checkpoint_en \
--uid=p299_096 \
--new_str='I enjoy my life, do you?' \
--prefix='./prompt/dev/' \
--source_lang=english \
--target_lang=english \
--output_name=pred_gen.wav \
--voc=pwgan_aishell3 \
--voc_config=download/pwg_aishell3_ckpt_0.5/default.yaml \
--voc_ckpt=download/pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
--voc_stat=download/pwg_aishell3_ckpt_0.5/feats_stats.npy \
--am=fastspeech2_ljspeech \
--am_config=download/fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml \
--am_ckpt=download/fastspeech2_nosil_ljspeech_ckpt_0.5/snapshot_iter_100000.pdz \
--am_stat=download/fastspeech2_nosil_ljspeech_ckpt_0.5/speech_stats.npy \
--phones_dict=download/fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt
#!/bin/bash
set -e
source path.sh
# 纯英文的语音编辑
# 样例为把 p243_new 对应的原始语音: For that reason cover should not be given.编辑成 'for that reason cover is impossible to be given.' 对应的语音
# NOTE: 语音编辑任务暂支持句子中 1 个位置的替换或者插入文本操作
python local/inference_new.py \
--task_name=edit \
--model_name=paddle_checkpoint_en \
--uid=p243_new \
--new_str='for that reason cover is impossible to be given.' \
--prefix='./prompt/dev/' \
--source_lang=english \
--target_lang=english \
--output_name=pred_edit.wav \
--voc=pwgan_aishell3 \
--voc_config=download/pwg_aishell3_ckpt_0.5/default.yaml \
--voc_ckpt=download/pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
--voc_stat=download/pwg_aishell3_ckpt_0.5/feats_stats.npy \
--am=fastspeech2_ljspeech \
--am_config=download/fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml \
--am_ckpt=download/fastspeech2_nosil_ljspeech_ckpt_0.5/snapshot_iter_100000.pdz \
--am_stat=download/fastspeech2_nosil_ljspeech_ckpt_0.5/speech_stats.npy \
--phones_dict=download/fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt
#!/bin/bash
rm -rf *.wav
./run_sedit_en_new.sh # 语音编辑任务(英文)
./run_gen_en_new.sh # 个性化语音合成任务(英文)
./run_clone_en_to_zh_new.sh # 跨语言语音合成任务(英文到中文的语音克隆)
\ No newline at end of file
...@@ -29,7 +29,7 @@ optimizer_params: ...@@ -29,7 +29,7 @@ optimizer_params:
scheduler_params: scheduler_params:
learning_rate: 1.0e-5 # learning rate. learning_rate: 1.0e-5 # learning rate.
gamma: 1.0 # scheduler gamma. gamma: 0.9999 # scheduler gamma must between(0.0, 1.0) and closer to 1.0 is better.
########################################################### ###########################################################
# TRAINING SETTING # # TRAINING SETTING #
......
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
fs: 24000 # sr
n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
# Only used for feats_type != raw
fmin: 80 # Minimum frequency of Mel basis.
fmax: 7600 # Maximum frequency of Mel basis.
n_mels: 80 # The number of mel basis.
mean_phn_span: 8
mlm_prob: 0.8
###########################################################
# DATA SETTING #
###########################################################
batch_size: 20
num_workers: 2
###########################################################
# MODEL SETTING #
###########################################################
model:
text_masking: false
postnet_layers: 5
postnet_filts: 5
postnet_chans: 256
encoder_type: conformer
decoder_type: conformer
enc_input_layer: sega_mlm
enc_pre_speech_layer: 0
enc_cnn_module_kernel: 7
enc_attention_dim: 384
enc_attention_heads: 2
enc_linear_units: 1536
enc_num_blocks: 4
enc_dropout_rate: 0.2
enc_positional_dropout_rate: 0.2
enc_attention_dropout_rate: 0.2
enc_normalize_before: true
enc_macaron_style: true
enc_use_cnn_module: true
enc_selfattention_layer_type: legacy_rel_selfattn
enc_activation_type: swish
enc_pos_enc_layer_type: legacy_rel_pos
enc_positionwise_layer_type: conv1d
enc_positionwise_conv_kernel_size: 3
dec_cnn_module_kernel: 31
dec_attention_dim: 384
dec_attention_heads: 2
dec_linear_units: 1536
dec_num_blocks: 4
dec_dropout_rate: 0.2
dec_positional_dropout_rate: 0.2
dec_attention_dropout_rate: 0.2
dec_macaron_style: true
dec_use_cnn_module: true
dec_selfattention_layer_type: legacy_rel_selfattn
dec_activation_type: swish
dec_pos_enc_layer_type: legacy_rel_pos
dec_positionwise_layer_type: conv1d
dec_positionwise_conv_kernel_size: 3
###########################################################
# OPTIMIZER SETTING #
###########################################################
scheduler_params:
d_model: 384
warmup_steps: 4000
grad_clip: 1.0
###########################################################
# TRAINING SETTING #
###########################################################
max_epoch: 1500
num_snapshots: 50
###########################################################
# OTHER SETTING #
###########################################################
seed: 0
token_list:
- <blank>
- <unk>
- AH0
- T
- N
- sp
- D
- S
- R
- L
- IH1
- DH
- AE1
- M
- EH1
- K
- Z
- W
- HH
- ER0
- AH1
- IY1
- P
- V
- F
- B
- AY1
- IY0
- EY1
- AA1
- AO1
- UW1
- IH0
- OW1
- NG
- G
- SH
- ER1
- Y
- TH
- AW1
- CH
- UH1
- IH2
- JH
- OW0
- EH2
- OY1
- AY2
- EH0
- EY2
- UW0
- AE2
- AA2
- OW2
- AH2
- ZH
- AO2
- IY2
- AE0
- UW2
- AY0
- AA0
- AO0
- AW2
- EY0
- UH2
- ER2
- OY2
- UH0
- AW0
- OY0
- <sos/eos>
#!/bin/bash
stage=0
stop_stage=100
config_path=$1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# get durations from MFA's result
echo "Generate durations.txt from MFA results ..."
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
--inputdir=./vctk_alignment \
--output durations.txt \
--config=${config_path}
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# extract features
echo "Extract features ..."
python3 ${BIN_DIR}/preprocess.py \
--dataset=vctk \
--rootdir=~/datasets/VCTK-Corpus-0.92/ \
--dumpdir=dump \
--dur-file=durations.txt \
--config=${config_path} \
--num-cpu=20 \
--cut-sil=True
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# get features' stats(mean and std)
echo "Get features' stats ..."
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
--metadata=dump/train/raw/metadata.jsonl \
--field-name="speech"
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# normalize and covert phone/speaker to id, dev and test should use train's stats
echo "Normalize ..."
python3 ${BIN_DIR}/normalize.py \
--metadata=dump/train/raw/metadata.jsonl \
--dumpdir=dump/train/norm \
--speech-stats=dump/train/speech_stats.npy \
--phones-dict=dump/phone_id_map.txt \
--speaker-dict=dump/speaker_id_map.txt
python3 ${BIN_DIR}/normalize.py \
--metadata=dump/dev/raw/metadata.jsonl \
--dumpdir=dump/dev/norm \
--speech-stats=dump/train/speech_stats.npy \
--phones-dict=dump/phone_id_map.txt \
--speaker-dict=dump/speaker_id_map.txt
python3 ${BIN_DIR}/normalize.py \
--metadata=dump/test/raw/metadata.jsonl \
--dumpdir=dump/test/norm \
--speech-stats=dump/train/speech_stats.npy \
--phones-dict=dump/phone_id_map.txt \
--speaker-dict=dump/speaker_id_map.txt
fi
#!/bin/bash
config_path=$1
train_output_path=$2
ckpt_name=$3
stage=1
stop_stage=1
# use am to predict duration here
# 增加 am_phones_dict am_tones_dict 等,也可以用新的方式构造 am, 不需要这么多参数了就
# pwgan
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize.py \
--erniesat_config=${config_path} \
--erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--erniesat_stat=dump/train/speech_stats.npy \
--voc=pwgan_vctk \
--voc_config=pwg_vctk_ckpt_0.1.1/default.yaml \
--voc_ckpt=pwg_vctk_ckpt_0.1.1/snapshot_iter_1500000.pdz \
--voc_stat=pwg_vctk_ckpt_0.1.1/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt
fi
# hifigan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize.py \
--erniesat_config=${config_path} \
--erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--erniesat_stat=dump/train/speech_stats.npy \
--voc=hifigan_vctk \
--voc_config=hifigan_vctk_ckpt_0.2.0/default.yaml \
--voc_ckpt=hifigan_vctk_ckpt_0.2.0/snapshot_iter_2500000.pdz \
--voc_stat=hifigan_vctk_ckpt_0.2.0/feats_stats.npy \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt
fi
#!/bin/bash
config_path=$1
train_output_path=$2
python3 ${BIN_DIR}/train.py \
--train-metadata=dump/train/norm/metadata.jsonl \
--dev-metadata=dump/dev/norm/metadata.jsonl \
--config=${config_path} \
--output-dir=${train_output_path} \
--ngpu=2 \
--phones-dict=dump/phone_id_map.txt
\ No newline at end of file
#!/bin/bash
export MAIN_ROOT=`realpath ${PWD}/../../../`
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
export PYTHONDONTWRITEBYTECODE=1
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
MODEL=ernie_sat
export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL}
\ No newline at end of file
#!/bin/bash
set -e
source path.sh
gpus=0,1
stage=0
stop_stage=100
conf_path=conf/default.yaml
train_output_path=exp/default
ckpt_name=snapshot_iter_153.pdz
# with the following command, you can choose the stage range you want to run
# such as `./run.sh --stage 0 --stop-stage 0`
# this can not be mixed use with `$1`, `$2` ...
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
./local/preprocess.sh ${conf_path} || exit -1
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# synthesize, vocoder is pwgan
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi
...@@ -24,7 +24,7 @@ f0max: 400 # Maximum f0 for pitch extraction. ...@@ -24,7 +24,7 @@ f0max: 400 # Maximum f0 for pitch extraction.
# DATA SETTING # # DATA SETTING #
########################################################### ###########################################################
batch_size: 64 batch_size: 64
num_workers: 4 num_workers: 2
########################################################### ###########################################################
......
# Test
We train a Chinese-English mixed fastspeech2 model. The training code is still being sorted out, let's show how to use it first.
The sample rate of the synthesized audio is 22050 Hz.
## Download pretrained models
Put pretrained models in a directory named `models`.
- [fastspeech2_csmscljspeech_add-zhen.zip](https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_csmscljspeech_add-zhen.zip)
- [hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip)
```bash
mkdir models
cd models
wget https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_csmscljspeech_add-zhen.zip
unzip fastspeech2_csmscljspeech_add-zhen.zip
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip
unzip hifigan_ljspeech_ckpt_0.2.0.zip
cd ../
```
## test
You can choose `--spk_id` {0, 1} in `local/synthesize_e2e.sh`.
```bash
bash test.sh
```
#!/bin/bash
model_dir=$1
output=$2
am_name=fastspeech2_csmscljspeech_add-zhen
am_model_dir=${model_dir}/${am_name}/
stage=1
stop_stage=1
# hifigan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_mix \
--am_config=${am_model_dir}/default.yaml \
--am_ckpt=${am_model_dir}/snapshot_iter_94000.pdz \
--am_stat=${am_model_dir}/speech_stats.npy \
--voc=hifigan_ljspeech \
--voc_config=${model_dir}/hifigan_ljspeech_ckpt_0.2.0/default.yaml \
--voc_ckpt=${model_dir}/hifigan_ljspeech_ckpt_0.2.0/snapshot_iter_2500000.pdz \
--voc_stat=${model_dir}/hifigan_ljspeech_ckpt_0.2.0/feats_stats.npy \
--lang=mix \
--text=${BIN_DIR}/../sentences_mix.txt \
--output_dir=${output}/test_e2e \
--phones_dict=${am_model_dir}/phone_id_map.txt \
--speaker_dict=${am_model_dir}/speaker_id_map.txt \
--spk_id 0
fi
#!/bin/bash
export MAIN_ROOT=`realpath ${PWD}/../../../`
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
export PYTHONDONTWRITEBYTECODE=1
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
MODEL=fastspeech2
export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL}
#!/bin/bash
set -e
source path.sh
gpus=0,1
stage=3
stop_stage=100
model_dir=models
output_dir=output
# with the following command, you can choose the stage range you want to run
# such as `./run.sh --stage 0 --stop-stage 0`
# this can not be mixed use with `$1`, `$2` ...
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# synthesize_e2e, vocoder is hifigan by default
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${model_dir} ${output_dir} || exit -1
fi
...@@ -29,8 +29,7 @@ from yacs.config import CfgNode ...@@ -29,8 +29,7 @@ from yacs.config import CfgNode
from ..executor import BaseExecutor from ..executor import BaseExecutor
from ..log import logger from ..log import logger
from ..utils import stats_wrapper from ..utils import stats_wrapper
from paddlespeech.t2s.frontend import English from paddlespeech.t2s.exps.syn_utils import get_frontend
from paddlespeech.t2s.frontend.zh_frontend import Frontend
from paddlespeech.t2s.modules.normalizer import ZScore from paddlespeech.t2s.modules.normalizer import ZScore
__all__ = ['TTSExecutor'] __all__ = ['TTSExecutor']
...@@ -54,6 +53,7 @@ class TTSExecutor(BaseExecutor): ...@@ -54,6 +53,7 @@ class TTSExecutor(BaseExecutor):
'fastspeech2_ljspeech', 'fastspeech2_ljspeech',
'fastspeech2_aishell3', 'fastspeech2_aishell3',
'fastspeech2_vctk', 'fastspeech2_vctk',
'fastspeech2_mix',
'tacotron2_csmsc', 'tacotron2_csmsc',
'tacotron2_ljspeech', 'tacotron2_ljspeech',
], ],
...@@ -98,7 +98,7 @@ class TTSExecutor(BaseExecutor): ...@@ -98,7 +98,7 @@ class TTSExecutor(BaseExecutor):
self.parser.add_argument( self.parser.add_argument(
'--voc', '--voc',
type=str, type=str,
default='pwgan_csmsc', default='hifigan_csmsc',
choices=[ choices=[
'pwgan_csmsc', 'pwgan_csmsc',
'pwgan_ljspeech', 'pwgan_ljspeech',
...@@ -135,7 +135,7 @@ class TTSExecutor(BaseExecutor): ...@@ -135,7 +135,7 @@ class TTSExecutor(BaseExecutor):
'--lang', '--lang',
type=str, type=str,
default='zh', default='zh',
help='Choose model language. zh or en') help='Choose model language. zh or en or mix')
self.parser.add_argument( self.parser.add_argument(
'--device', '--device',
type=str, type=str,
...@@ -231,8 +231,11 @@ class TTSExecutor(BaseExecutor): ...@@ -231,8 +231,11 @@ class TTSExecutor(BaseExecutor):
use_pretrained_voc = True use_pretrained_voc = True
else: else:
use_pretrained_voc = False use_pretrained_voc = False
voc_lang = lang
voc_tag = voc + '-' + lang # we must use ljspeech's voc for mix am now!
if lang == 'mix':
voc_lang = 'en'
voc_tag = voc + '-' + voc_lang
self.task_resource.set_task_model( self.task_resource.set_task_model(
model_tag=voc_tag, model_tag=voc_tag,
model_type=1, # vocoder model_type=1, # vocoder
...@@ -281,13 +284,8 @@ class TTSExecutor(BaseExecutor): ...@@ -281,13 +284,8 @@ class TTSExecutor(BaseExecutor):
spk_num = len(spk_id) spk_num = len(spk_id)
# frontend # frontend
if lang == 'zh': self.frontend = get_frontend(
self.frontend = Frontend( lang=lang, phones_dict=self.phones_dict, tones_dict=self.tones_dict)
phone_vocab_path=self.phones_dict,
tone_vocab_path=self.tones_dict)
elif lang == 'en':
self.frontend = English(phone_vocab_path=self.phones_dict)
# acoustic model # acoustic model
odim = self.am_config.n_mels odim = self.am_config.n_mels
...@@ -381,8 +379,12 @@ class TTSExecutor(BaseExecutor): ...@@ -381,8 +379,12 @@ class TTSExecutor(BaseExecutor):
input_ids = self.frontend.get_input_ids( input_ids = self.frontend.get_input_ids(
text, merge_sentences=merge_sentences) text, merge_sentences=merge_sentences)
phone_ids = input_ids["phone_ids"] phone_ids = input_ids["phone_ids"]
elif lang == 'mix':
input_ids = self.frontend.get_input_ids(
text, merge_sentences=merge_sentences)
phone_ids = input_ids["phone_ids"]
else: else:
logger.error("lang should in {'zh', 'en'}!") logger.error("lang should in {'zh', 'en', 'mix'}!")
self.frontend_time = time.time() - frontend_st self.frontend_time = time.time() - frontend_st
self.am_time = 0 self.am_time = 0
...@@ -398,7 +400,7 @@ class TTSExecutor(BaseExecutor): ...@@ -398,7 +400,7 @@ class TTSExecutor(BaseExecutor):
# fastspeech2 # fastspeech2
else: else:
# multi speaker # multi speaker
if am_dataset in {"aishell3", "vctk"}: if am_dataset in {'aishell3', 'vctk', 'mix'}:
mel = self.am_inference( mel = self.am_inference(
part_phone_ids, spk_id=paddle.to_tensor(spk_id)) part_phone_ids, spk_id=paddle.to_tensor(spk_id))
else: else:
......
...@@ -655,6 +655,24 @@ tts_dynamic_pretrained_models = { ...@@ -655,6 +655,24 @@ tts_dynamic_pretrained_models = {
'phone_id_map.txt', 'phone_id_map.txt',
}, },
}, },
"fastspeech2_mix-mix": {
'1.0': {
'url':
'https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_csmscljspeech_add-zhen.zip',
'md5':
'77d9d4b5a79ed6203339ead7ef6c74f9',
'config':
'default.yaml',
'ckpt':
'snapshot_iter_94000.pdz',
'speech_stats':
'speech_stats.npy',
'phones_dict':
'phone_id_map.txt',
'speaker_dict':
'speaker_id_map.txt',
},
},
# tacotron2 # tacotron2
"tacotron2_csmsc-zh": { "tacotron2_csmsc-zh": {
'1.0': { '1.0': {
......
...@@ -630,12 +630,10 @@ class U2BaseModel(ASRInterface, nn.Layer): ...@@ -630,12 +630,10 @@ class U2BaseModel(ASRInterface, nn.Layer):
(elayers, head, cache_t1, d_k * 2), where (elayers, head, cache_t1, d_k * 2), where
`head * d_k == hidden-dim` and `head * d_k == hidden-dim` and
`cache_t1 == chunk_size * num_decoding_left_chunks`. `cache_t1 == chunk_size * num_decoding_left_chunks`.
`d_k * 2` for att key & value. Default is 0-dims Tensor, `d_k * 2` for att key & value.
it is used for dy2st.
cnn_cache (paddle.Tensor): cache tensor for cnn_module in conformer, cnn_cache (paddle.Tensor): cache tensor for cnn_module in conformer,
(elayers, b=1, hidden-dim, cache_t2), where (elayers, b=1, hidden-dim, cache_t2), where
`cache_t2 == cnn.lorder - 1`. Default is 0-dims Tensor, `cache_t2 == cnn.lorder - 1`.
it is used for dy2st.
Returns: Returns:
paddle.Tensor: output of current input xs, paddle.Tensor: output of current input xs,
......
...@@ -76,9 +76,9 @@ class TransformerEncoderLayer(nn.Layer): ...@@ -76,9 +76,9 @@ class TransformerEncoderLayer(nn.Layer):
x: paddle.Tensor, x: paddle.Tensor,
mask: paddle.Tensor, mask: paddle.Tensor,
pos_emb: paddle.Tensor, pos_emb: paddle.Tensor,
mask_pad: paddle.Tensor= paddle.ones([0,0,0], dtype=paddle.bool), mask_pad: paddle.Tensor=paddle.ones([0, 0, 0], dtype=paddle.bool),
att_cache: paddle.Tensor=paddle.zeros([0,0,0,0]), att_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
cnn_cache: paddle.Tensor=paddle.zeros([0,0,0,0]), cnn_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]: ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]:
"""Compute encoded features. """Compute encoded features.
Args: Args:
...@@ -105,9 +105,7 @@ class TransformerEncoderLayer(nn.Layer): ...@@ -105,9 +105,7 @@ class TransformerEncoderLayer(nn.Layer):
if self.normalize_before: if self.normalize_before:
x = self.norm1(x) x = self.norm1(x)
x_att, new_att_cache = self.self_attn( x_att, new_att_cache = self.self_attn(x, x, x, mask, cache=att_cache)
x, x, x, mask, cache=att_cache
)
if self.concat_after: if self.concat_after:
x_concat = paddle.concat((x, x_att), axis=-1) x_concat = paddle.concat((x, x_att), axis=-1)
...@@ -124,7 +122,7 @@ class TransformerEncoderLayer(nn.Layer): ...@@ -124,7 +122,7 @@ class TransformerEncoderLayer(nn.Layer):
if not self.normalize_before: if not self.normalize_before:
x = self.norm2(x) x = self.norm2(x)
fake_cnn_cache = paddle.zeros([0,0,0], dtype=x.dtype) fake_cnn_cache = paddle.zeros([0, 0, 0], dtype=x.dtype)
return x, mask, new_att_cache, fake_cnn_cache return x, mask, new_att_cache, fake_cnn_cache
...@@ -195,9 +193,9 @@ class ConformerEncoderLayer(nn.Layer): ...@@ -195,9 +193,9 @@ class ConformerEncoderLayer(nn.Layer):
x: paddle.Tensor, x: paddle.Tensor,
mask: paddle.Tensor, mask: paddle.Tensor,
pos_emb: paddle.Tensor, pos_emb: paddle.Tensor,
mask_pad: paddle.Tensor= paddle.ones([0,0,0], dtype=paddle.bool), mask_pad: paddle.Tensor=paddle.ones([0, 0, 0], dtype=paddle.bool),
att_cache: paddle.Tensor=paddle.zeros([0,0,0,0]), att_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
cnn_cache: paddle.Tensor=paddle.zeros([0,0,0,0]), cnn_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]: ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]:
"""Compute encoded features. """Compute encoded features.
Args: Args:
...@@ -211,7 +209,8 @@ class ConformerEncoderLayer(nn.Layer): ...@@ -211,7 +209,8 @@ class ConformerEncoderLayer(nn.Layer):
att_cache (paddle.Tensor): Cache tensor of the KEY & VALUE att_cache (paddle.Tensor): Cache tensor of the KEY & VALUE
(#batch=1, head, cache_t1, d_k * 2), head * d_k == size. (#batch=1, head, cache_t1, d_k * 2), head * d_k == size.
cnn_cache (paddle.Tensor): Convolution cache in conformer layer cnn_cache (paddle.Tensor): Convolution cache in conformer layer
(#batch=1, size, cache_t2) (1, #batch=1, size, cache_t2). First dim will not be used, just
for dy2st.
Returns: Returns:
paddle.Tensor: Output tensor (#batch, time, size). paddle.Tensor: Output tensor (#batch, time, size).
paddle.Tensor: Mask tensor (#batch, time, time). paddle.Tensor: Mask tensor (#batch, time, time).
...@@ -219,6 +218,8 @@ class ConformerEncoderLayer(nn.Layer): ...@@ -219,6 +218,8 @@ class ConformerEncoderLayer(nn.Layer):
(#batch=1, head, cache_t1 + time, d_k * 2). (#batch=1, head, cache_t1 + time, d_k * 2).
paddle.Tensor: cnn_cahce tensor (#batch, size, cache_t2). paddle.Tensor: cnn_cahce tensor (#batch, size, cache_t2).
""" """
# (1, #batch=1, size, cache_t2) -> (#batch=1, size, cache_t2)
cnn_cache = paddle.squeeze(cnn_cache, axis=0)
# whether to use macaron style FFN # whether to use macaron style FFN
if self.feed_forward_macaron is not None: if self.feed_forward_macaron is not None:
...@@ -249,8 +250,7 @@ class ConformerEncoderLayer(nn.Layer): ...@@ -249,8 +250,7 @@ class ConformerEncoderLayer(nn.Layer):
# convolution module # convolution module
# Fake new cnn cache here, and then change it in conv_module # Fake new cnn cache here, and then change it in conv_module
new_cnn_cache = paddle.zeros([0,0,0], dtype=x.dtype) new_cnn_cache = paddle.zeros([0, 0, 0], dtype=x.dtype)
cnn_cache = paddle.squeeze(cnn_cache, axis=0)
if self.conv_module is not None: if self.conv_module is not None:
residual = x residual = x
if self.normalize_before: if self.normalize_before:
......
...@@ -60,7 +60,10 @@ def warm_up(engine_and_type: str, warm_up_time: int=3) -> bool: ...@@ -60,7 +60,10 @@ def warm_up(engine_and_type: str, warm_up_time: int=3) -> bool:
else: else:
st = time.time() st = time.time()
connection_handler.infer(text=sentence) connection_handler.infer(
text=sentence,
lang=tts_engine.lang,
am=tts_engine.config.am)
et = time.time() et = time.time()
logger.debug( logger.debug(
f"The response time of the {i} warm up: {et - st} s") f"The response time of the {i} warm up: {et - st} s")
......
...@@ -28,6 +28,150 @@ from paddlespeech.t2s.modules.nets_utils import phones_masking ...@@ -28,6 +28,150 @@ from paddlespeech.t2s.modules.nets_utils import phones_masking
from paddlespeech.t2s.modules.nets_utils import phones_text_masking from paddlespeech.t2s.modules.nets_utils import phones_text_masking
# 因为要传参数,所以需要额外构建
def build_erniesat_collate_fn(mlm_prob: float=0.8,
mean_phn_span: int=8,
seg_emb: bool=False,
text_masking: bool=False):
return ErnieSATCollateFn(
mlm_prob=mlm_prob,
mean_phn_span=mean_phn_span,
seg_emb=seg_emb,
text_masking=text_masking)
class ErnieSATCollateFn:
"""Functor class of common_collate_fn()"""
def __init__(self,
mlm_prob: float=0.8,
mean_phn_span: int=8,
seg_emb: bool=False,
text_masking: bool=False):
self.mlm_prob = mlm_prob
self.mean_phn_span = mean_phn_span
self.seg_emb = seg_emb
self.text_masking = text_masking
def __call__(self, exmaples):
return erniesat_batch_fn(
exmaples,
mlm_prob=self.mlm_prob,
mean_phn_span=self.mean_phn_span,
seg_emb=self.seg_emb,
text_masking=self.text_masking)
def erniesat_batch_fn(examples,
mlm_prob: float=0.8,
mean_phn_span: int=8,
seg_emb: bool=False,
text_masking: bool=False):
# fields = ["text", "text_lengths", "speech", "speech_lengths", "align_start", "align_end"]
text = [np.array(item["text"], dtype=np.int64) for item in examples]
speech = [np.array(item["speech"], dtype=np.float32) for item in examples]
text_lengths = [
np.array(item["text_lengths"], dtype=np.int64) for item in examples
]
speech_lengths = [
np.array(item["speech_lengths"], dtype=np.int64) for item in examples
]
align_start = [
np.array(item["align_start"], dtype=np.int64) for item in examples
]
align_end = [
np.array(item["align_end"], dtype=np.int64) for item in examples
]
align_start_lengths = [
np.array(len(item["align_start"]), dtype=np.int64) for item in examples
]
# add_pad
text = batch_sequences(text)
speech = batch_sequences(speech)
align_start = batch_sequences(align_start)
align_end = batch_sequences(align_end)
# convert each batch to paddle.Tensor
text = paddle.to_tensor(text)
speech = paddle.to_tensor(speech)
text_lengths = paddle.to_tensor(text_lengths)
speech_lengths = paddle.to_tensor(speech_lengths)
align_start_lengths = paddle.to_tensor(align_start_lengths)
speech_pad = speech
text_pad = text
text_mask = make_non_pad_mask(
text_lengths, text_pad, length_dim=1).unsqueeze(-2)
speech_mask = make_non_pad_mask(
speech_lengths, speech_pad[:, :, 0], length_dim=1).unsqueeze(-2)
# for training
span_bdy = None
# for inference
if 'span_bdy' in examples[0].keys():
span_bdy = [
np.array(item["span_bdy"], dtype=np.int64) for item in examples
]
span_bdy = paddle.to_tensor(span_bdy)
# dual_mask 的是混合中英时候同时 mask 语音和文本
# ernie sat 在实现跨语言的时候都 mask 了
if text_masking:
masked_pos, text_masked_pos = phones_text_masking(
xs_pad=speech_pad,
src_mask=speech_mask,
text_pad=text_pad,
text_mask=text_mask,
align_start=align_start,
align_end=align_end,
align_start_lens=align_start_lengths,
mlm_prob=mlm_prob,
mean_phn_span=mean_phn_span,
span_bdy=span_bdy)
# 训练纯中文和纯英文的 -> a3t 没有对 phoneme 做 mask, 只对语音 mask 了
# a3t 和 ernie sat 的区别主要在于做 mask 的时候
else:
masked_pos = phones_masking(
xs_pad=speech_pad,
src_mask=speech_mask,
align_start=align_start,
align_end=align_end,
align_start_lens=align_start_lengths,
mlm_prob=mlm_prob,
mean_phn_span=mean_phn_span,
span_bdy=span_bdy)
text_masked_pos = paddle.zeros(paddle.shape(text_pad))
speech_seg_pos, text_seg_pos = get_seg_pos(
speech_pad=speech_pad,
text_pad=text_pad,
align_start=align_start,
align_end=align_end,
align_start_lens=align_start_lengths,
seg_emb=seg_emb)
batch = {
"text": text,
"speech": speech,
# need to generate
"masked_pos": masked_pos,
"speech_mask": speech_mask,
"text_mask": text_mask,
"speech_seg_pos": speech_seg_pos,
"text_seg_pos": text_seg_pos,
"text_masked_pos": text_masked_pos
}
return batch
def tacotron2_single_spk_batch_fn(examples): def tacotron2_single_spk_batch_fn(examples):
# fields = ["text", "text_lengths", "speech", "speech_lengths"] # fields = ["text", "text_lengths", "speech", "speech_lengths"]
text = [np.array(item["text"], dtype=np.int64) for item in examples] text = [np.array(item["text"], dtype=np.int64) for item in examples]
...@@ -378,7 +522,6 @@ class MLMCollateFn: ...@@ -378,7 +522,6 @@ class MLMCollateFn:
mean_phn_span=self.mean_phn_span, mean_phn_span=self.mean_phn_span,
seg_emb=self.seg_emb, seg_emb=self.seg_emb,
text_masking=self.text_masking, text_masking=self.text_masking,
attention_window=self.attention_window,
not_sequence=self.not_sequence) not_sequence=self.not_sequence)
...@@ -389,7 +532,6 @@ def mlm_collate_fn( ...@@ -389,7 +532,6 @@ def mlm_collate_fn(
mean_phn_span: int=8, mean_phn_span: int=8,
seg_emb: bool=False, seg_emb: bool=False,
text_masking: bool=False, text_masking: bool=False,
attention_window: int=0,
pad_value: int=0, pad_value: int=0,
not_sequence: Collection[str]=(), not_sequence: Collection[str]=(),
) -> Tuple[List[str], Dict[str, paddle.Tensor]]: ) -> Tuple[List[str], Dict[str, paddle.Tensor]]:
...@@ -420,6 +562,7 @@ def mlm_collate_fn( ...@@ -420,6 +562,7 @@ def mlm_collate_fn(
feats = feats_extract.get_log_mel_fbank(np.array(output["speech"][0])) feats = feats_extract.get_log_mel_fbank(np.array(output["speech"][0]))
feats = paddle.to_tensor(feats) feats = paddle.to_tensor(feats)
print("feats.shape:", feats.shape)
feats_lens = paddle.shape(feats)[0] feats_lens = paddle.shape(feats)[0]
feats = paddle.unsqueeze(feats, 0) feats = paddle.unsqueeze(feats, 0)
...@@ -439,6 +582,7 @@ def mlm_collate_fn( ...@@ -439,6 +582,7 @@ def mlm_collate_fn(
text_lens, text_pad, length_dim=1).unsqueeze(-2) text_lens, text_pad, length_dim=1).unsqueeze(-2)
speech_mask = make_non_pad_mask( speech_mask = make_non_pad_mask(
feats_lens, speech_pad[:, :, 0], length_dim=1).unsqueeze(-2) feats_lens, speech_pad[:, :, 0], length_dim=1).unsqueeze(-2)
span_bdy = None span_bdy = None
if 'span_bdy' in output.keys(): if 'span_bdy' in output.keys():
span_bdy = output['span_bdy'] span_bdy = output['span_bdy']
......
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import shutil
from pathlib import Path
import librosa
import numpy as np
import pypinyin
from praatio import textgrid
from paddlespeech.t2s.exps.ernie_sat.utils import get_tmp_name
from paddlespeech.t2s.exps.ernie_sat.utils import get_dict
DICT_EN = 'tools/aligner/cmudict-0.7b'
DICT_ZH = 'tools/aligner/simple.lexicon'
MODEL_DIR_EN = 'tools/aligner/vctk_model.zip'
MODEL_DIR_ZH = 'tools/aligner/aishell3_model.zip'
MFA_PATH = 'tools/montreal-forced-aligner/bin'
os.environ['PATH'] = MFA_PATH + '/:' + os.environ['PATH']
def _get_max_idx(dic):
return sorted([int(key.split('_')[0]) for key in dic.keys()])[-1]
def _readtg(tg_path: str, lang: str='en', fs: int=24000, n_shift: int=300):
alignment = textgrid.openTextgrid(tg_path, includeEmptyIntervals=True)
phones = []
ends = []
words = []
for interval in alignment.tierDict['words'].entryList:
word = interval.label
if word:
words.append(word)
for interval in alignment.tierDict['phones'].entryList:
phone = interval.label
phones.append(phone)
ends.append(interval.end)
frame_pos = librosa.time_to_frames(ends, sr=fs, hop_length=n_shift)
durations = np.diff(frame_pos, prepend=0)
assert len(durations) == len(phones)
# merge '' and sp in the end
if phones[-1] == '' and len(phones) > 1 and phones[-2] == 'sp':
phones = phones[:-1]
durations[-2] += durations[-1]
durations = durations[:-1]
# replace ' and 'sil' with 'sp'
phones = ['sp' if (phn == '' or phn == 'sil') else phn for phn in phones]
if lang == 'en':
DICT = DICT_EN
elif lang == 'zh':
DICT = DICT_ZH
word2phns_dict = get_dict(DICT)
phn2word_dict = []
for word in words:
if lang == 'en':
word = word.upper()
phn2word_dict.append([word2phns_dict[word].split(), word])
non_sp_idx = 0
word_idx = 0
i = 0
word2phns = {}
while i < len(phones):
phn = phones[i]
if phn == 'sp':
word2phns[str(word_idx) + '_sp'] = ['sp']
i += 1
else:
phns, word = phn2word_dict[non_sp_idx]
word2phns[str(word_idx) + '_' + word] = phns
non_sp_idx += 1
i += len(phns)
word_idx += 1
sum_phn = sum(len(word2phns[k]) for k in word2phns)
assert sum_phn == len(phones)
results = ''
for (p, d) in zip(phones, durations):
results += p + ' ' + str(d) + ' '
return results.strip(), word2phns
def alignment(wav_path: str,
text: str,
fs: int=24000,
lang='en',
n_shift: int=300):
wav_name = os.path.basename(wav_path)
utt = wav_name.split('.')[0]
# prepare data for MFA
tmp_name = get_tmp_name(text=text)
tmpbase = './tmp_dir/' + tmp_name
tmpbase = Path(tmpbase)
tmpbase.mkdir(parents=True, exist_ok=True)
print("tmp_name in alignment:",tmp_name)
shutil.copyfile(wav_path, tmpbase / wav_name)
txt_name = utt + '.txt'
txt_path = tmpbase / txt_name
with open(txt_path, 'w') as wf:
wf.write(text + '\n')
# MFA
if lang == 'en':
DICT = DICT_EN
MODEL_DIR = MODEL_DIR_EN
elif lang == 'zh':
DICT = DICT_ZH
MODEL_DIR = MODEL_DIR_ZH
else:
print('please input right lang!!')
CMD = 'mfa_align' + ' ' + str(
tmpbase) + ' ' + DICT + ' ' + MODEL_DIR + ' ' + str(tmpbase)
os.system(CMD)
tg_path = str(tmpbase) + '/' + tmp_name + '/' + utt + '.TextGrid'
phn_dur, word2phns = _readtg(tg_path, lang=lang)
phn_dur = phn_dur.split()
phns = phn_dur[::2]
durs = phn_dur[1::2]
durs = [int(d) for d in durs]
assert len(phns) == len(durs)
return phns, durs, word2phns
def words2phns(text: str, lang='en'):
'''
Args:
text (str):
input text.
eg: for that reason cover is impossible to be given.
lang (str):
'en' or 'zh'
Returns:
List[str]: phones of input text.
eg:
['F', 'AO1', 'R', 'DH', 'AE1', 'T', 'R', 'IY1', 'Z', 'AH0', 'N', 'K', 'AH1', 'V', 'ER0',
'IH1', 'Z', 'IH2', 'M', 'P', 'AA1', 'S', 'AH0', 'B', 'AH0', 'L', 'T', 'UW1', 'B', 'IY1',
'G', 'IH1', 'V', 'AH0', 'N']
Dict(str, str): key - idx_word
value - phones
eg:
{'0_FOR': ['F', 'AO1', 'R'], '1_THAT': ['DH', 'AE1', 'T'],
'2_REASON': ['R', 'IY1', 'Z', 'AH0', 'N'],'3_COVER': ['K', 'AH1', 'V', 'ER0'], '4_IS': ['IH1', 'Z'],
'5_IMPOSSIBLE': ['IH2', 'M', 'P', 'AA1', 'S', 'AH0', 'B', 'AH0', 'L'],
'6_TO': ['T', 'UW1'], '7_BE': ['B', 'IY1'], '8_GIVEN': ['G', 'IH1', 'V', 'AH0', 'N']}
'''
text = text.strip()
words = []
for pun in [
',', '.', ':', ';', '!', '?', '"', '(', ')', '--', '---', u',',
u'。', u':', u';', u'!', u'?', u'(', u')'
]:
text = text.replace(pun, ' ')
for wrd in text.split():
if (wrd[-1] == '-'):
wrd = wrd[:-1]
if (wrd[0] == "'"):
wrd = wrd[1:]
if wrd:
words.append(wrd)
if lang == 'en':
dictfile = DICT_EN
elif lang == 'zh':
dictfile = DICT_ZH
else:
print('please input right lang!!')
word2phns_dict = get_dict(dictfile)
ds = word2phns_dict.keys()
phns = []
wrd2phns = {}
for index, wrd in enumerate(words):
if lang == 'en':
wrd = wrd.upper()
if (wrd not in ds):
wrd2phns[str(index) + '_' + wrd] = 'spn'
phns.extend('spn')
else:
wrd2phns[str(index) + '_' + wrd] = word2phns_dict[wrd].split()
phns.extend(word2phns_dict[wrd].split())
return phns, wrd2phns
def get_phns_spans(wav_path: str,
old_str: str='',
new_str: str='',
source_lang: str='en',
target_lang: str='en',
fs: int=24000,
n_shift: int=300):
is_append = (old_str == new_str[:len(old_str)])
old_phns, mfa_start, mfa_end = [], [], []
# source
lang = source_lang
phn, dur, w2p = alignment(
wav_path=wav_path, text=old_str, lang=lang, fs=fs, n_shift=n_shift)
new_d_cumsum = np.pad(np.array(dur).cumsum(0), (1, 0), 'constant').tolist()
mfa_start = new_d_cumsum[:-1]
mfa_end = new_d_cumsum[1:]
old_phns = phn
# target
if is_append and (source_lang != target_lang):
cross_lingual_clone = True
else:
cross_lingual_clone = False
if cross_lingual_clone:
str_origin = new_str[:len(old_str)]
str_append = new_str[len(old_str):]
if target_lang == 'zh':
phns_origin, origin_w2p = words2phns(str_origin, lang='en')
phns_append, append_w2p_tmp = words2phns(str_append, lang='zh')
elif target_lang == 'en':
# 原始句子
phns_origin, origin_w2p = words2phns(str_origin, lang='zh')
# clone 句子
phns_append, append_w2p_tmp = words2phns(str_append, lang='en')
else:
assert target_lang == 'zh' or target_lang == 'en', \
'cloning is not support for this language, please check it.'
new_phns = phns_origin + phns_append
append_w2p = {}
length = len(origin_w2p)
for key, value in append_w2p_tmp.items():
idx, wrd = key.split('_')
append_w2p[str(int(idx) + length) + '_' + wrd] = value
new_w2p = origin_w2p.copy()
new_w2p.update(append_w2p)
else:
if source_lang == target_lang:
new_phns, new_w2p = words2phns(new_str, lang=source_lang)
else:
assert source_lang == target_lang, \
'source language is not same with target language...'
span_to_repl = [0, len(old_phns) - 1]
span_to_add = [0, len(new_phns) - 1]
left_idx = 0
new_phns_left = []
sp_count = 0
# find the left different index
# 因为可能 align 时候的 words2phns 和直接 words2phns, 前者会有 sp?
for key in w2p.keys():
idx, wrd = key.split('_')
if wrd == 'sp':
sp_count += 1
new_phns_left.append('sp')
else:
idx = str(int(idx) - sp_count)
if idx + '_' + wrd in new_w2p:
# 是 new_str phn 序列的 index
left_idx += len(new_w2p[idx + '_' + wrd])
# old phn 序列
new_phns_left.extend(w2p[key])
else:
span_to_repl[0] = len(new_phns_left)
span_to_add[0] = len(new_phns_left)
break
# reverse w2p and new_w2p
right_idx = 0
new_phns_right = []
sp_count = 0
w2p_max_idx = _get_max_idx(w2p)
new_w2p_max_idx = _get_max_idx(new_w2p)
new_phns_mid = []
if is_append:
new_phns_right = []
new_phns_mid = new_phns[left_idx:]
span_to_repl[0] = len(new_phns_left)
span_to_add[0] = len(new_phns_left)
span_to_add[1] = len(new_phns_left) + len(new_phns_mid)
span_to_repl[1] = len(old_phns) - len(new_phns_right)
# speech edit
else:
for key in list(w2p.keys())[::-1]:
idx, wrd = key.split('_')
if wrd == 'sp':
sp_count += 1
new_phns_right = ['sp'] + new_phns_right
else:
idx = str(new_w2p_max_idx - (w2p_max_idx - int(idx) - sp_count))
if idx + '_' + wrd in new_w2p:
right_idx -= len(new_w2p[idx + '_' + wrd])
new_phns_right = w2p[key] + new_phns_right
else:
span_to_repl[1] = len(old_phns) - len(new_phns_right)
new_phns_mid = new_phns[left_idx:right_idx]
span_to_add[1] = len(new_phns_left) + len(new_phns_mid)
if len(new_phns_mid) == 0:
span_to_add[1] = min(span_to_add[1] + 1, len(new_phns))
span_to_add[0] = max(0, span_to_add[0] - 1)
span_to_repl[0] = max(0, span_to_repl[0] - 1)
span_to_repl[1] = min(span_to_repl[1] + 1,
len(old_phns))
break
new_phns = new_phns_left + new_phns_mid + new_phns_right
'''
For that reason cover should not be given.
For that reason cover is impossible to be given.
span_to_repl: [17, 23] "should not"
span_to_add: [17, 30] "is impossible to"
'''
outs = {}
outs['mfa_start'] = mfa_start
outs['mfa_end'] = mfa_end
outs['old_phns'] = old_phns
outs['new_phns'] = new_phns
outs['span_to_repl'] = span_to_repl
outs['span_to_add'] = span_to_add
return outs
if __name__ == '__main__':
text = "For that reason cover should not be given."
phn, dur, word2phns = alignment("exp/p243_313.wav", text, lang='en')
print(phn, dur)
print(word2phns)
print("---------------------------------")
# 这里可以用我们的中文前端得到 pinyin 序列
text_zh = "卡尔普陪外孙玩滑梯。"
text_zh = pypinyin.lazy_pinyin(
text_zh,
neutral_tone_with_five=True,
style=pypinyin.Style.TONE3,
tone_sandhi=True)
text_zh = " ".join(text_zh)
phn, dur, word2phns = alignment("exp/000001.wav", text_zh, lang='zh')
print(phn, dur)
print(word2phns)
print("---------------------------------")
phns, wrd2phns = words2phns(text, lang='en')
print("phns:", phns)
print("wrd2phns:", wrd2phns)
print("---------------------------------")
phns, wrd2phns = words2phns(text_zh, lang='zh')
print("phns:", phns)
print("wrd2phns:", wrd2phns)
print("---------------------------------")
outs = get_phns_spans(
wav_path="exp/p243_313.wav",
old_str="For that reason cover should not be given.",
new_str="for that reason cover is impossible to be given.")
mfa_start = outs["mfa_start"]
mfa_end = outs["mfa_end"]
old_phns = outs["old_phns"]
new_phns = outs["new_phns"]
span_to_repl = outs["span_to_repl"]
span_to_add = outs["span_to_add"]
print("mfa_start:", mfa_start)
print("mfa_end:", mfa_end)
print("old_phns:", old_phns)
print("new_phns:", new_phns)
print("span_to_repl:", span_to_repl)
print("span_to_add:", span_to_add)
print("---------------------------------")
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Normalize feature files and dump them."""
import argparse
import logging
from operator import itemgetter
from pathlib import Path
import jsonlines
import numpy as np
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm
from paddlespeech.t2s.datasets.data_table import DataTable
def main():
"""Run preprocessing process."""
parser = argparse.ArgumentParser(
description="Normalize dumped raw features (See detail in parallel_wavegan/bin/normalize.py)."
)
parser.add_argument(
"--metadata",
type=str,
required=True,
help="directory including feature files to be normalized. "
"you need to specify either *-scp or rootdir.")
parser.add_argument(
"--dumpdir",
type=str,
required=True,
help="directory to dump normalized feature files.")
parser.add_argument(
"--speech-stats",
type=str,
required=True,
help="speech statistics file.")
parser.add_argument(
"--phones-dict", type=str, default=None, help="phone vocabulary file.")
parser.add_argument(
"--speaker-dict", type=str, default=None, help="speaker id map file.")
args = parser.parse_args()
dumpdir = Path(args.dumpdir).expanduser()
# use absolute path
dumpdir = dumpdir.resolve()
dumpdir.mkdir(parents=True, exist_ok=True)
# get dataset
with jsonlines.open(args.metadata, 'r') as reader:
metadata = list(reader)
dataset = DataTable(
metadata, converters={
"speech": np.load,
})
logging.info(f"The number of files = {len(dataset)}.")
# restore scaler
speech_scaler = StandardScaler()
speech_scaler.mean_ = np.load(args.speech_stats)[0]
speech_scaler.scale_ = np.load(args.speech_stats)[1]
speech_scaler.n_features_in_ = speech_scaler.mean_.shape[0]
vocab_phones = {}
with open(args.phones_dict, 'rt') as f:
phn_id = [line.strip().split() for line in f.readlines()]
for phn, id in phn_id:
vocab_phones[phn] = int(id)
vocab_speaker = {}
with open(args.speaker_dict, 'rt') as f:
spk_id = [line.strip().split() for line in f.readlines()]
for spk, id in spk_id:
vocab_speaker[spk] = int(id)
# process each file
output_metadata = []
for item in tqdm(dataset):
utt_id = item['utt_id']
speech = item['speech']
# normalize
speech = speech_scaler.transform(speech)
speech_dir = dumpdir / "data_speech"
speech_dir.mkdir(parents=True, exist_ok=True)
speech_path = speech_dir / f"{utt_id}_speech.npy"
np.save(speech_path, speech.astype(np.float32), allow_pickle=False)
phone_ids = [vocab_phones[p] for p in item['phones']]
spk_id = vocab_speaker[item["speaker"]]
record = {
"utt_id": item['utt_id'],
"spk_id": spk_id,
"text": phone_ids,
"text_lengths": item['text_lengths'],
"speech_lengths": item['speech_lengths'],
"durations": item['durations'],
"speech": str(speech_path),
"align_start": item['align_start'],
"align_end": item['align_end'],
}
# add spk_emb for voice cloning
if "spk_emb" in item:
record["spk_emb"] = str(item["spk_emb"])
output_metadata.append(record)
output_metadata.sort(key=itemgetter('utt_id'))
output_metadata_path = Path(args.dumpdir) / "metadata.jsonl"
with jsonlines.open(output_metadata_path, 'w') as writer:
for item in output_metadata:
writer.write(item)
logging.info(f"metadata dumped into {output_metadata_path}")
if __name__ == "__main__":
main()
此差异已折叠。
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import logging
from pathlib import Path
import jsonlines
import numpy as np
import paddle
import soundfile as sf
import yaml
from yacs.config import CfgNode
from paddlespeech.t2s.datasets.am_batch_fn import build_erniesat_collate_fn
from paddlespeech.t2s.exps.syn_utils import denorm
from paddlespeech.t2s.exps.syn_utils import get_am_inference
from paddlespeech.t2s.exps.syn_utils import get_test_dataset
from paddlespeech.t2s.exps.syn_utils import get_voc_inference
def evaluate(args):
# dataloader has been too verbose
logging.getLogger("DataLoader").disabled = True
# construct dataset for evaluation
with jsonlines.open(args.test_metadata, 'r') as reader:
test_metadata = list(reader)
# Init body.
with open(args.erniesat_config) as f:
erniesat_config = CfgNode(yaml.safe_load(f))
with open(args.voc_config) as f:
voc_config = CfgNode(yaml.safe_load(f))
print("========Args========")
print(yaml.safe_dump(vars(args)))
print("========Config========")
print(erniesat_config)
print(voc_config)
# ernie sat model
erniesat_inference = get_am_inference(
am='erniesat_dataset',
am_config=erniesat_config,
am_ckpt=args.erniesat_ckpt,
am_stat=args.erniesat_stat,
phones_dict=args.phones_dict)
test_dataset = get_test_dataset(
test_metadata=test_metadata, am='erniesat_dataset')
# vocoder
voc_inference = get_voc_inference(
voc=args.voc,
voc_config=voc_config,
voc_ckpt=args.voc_ckpt,
voc_stat=args.voc_stat)
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
collate_fn = build_erniesat_collate_fn(
mlm_prob=erniesat_config.mlm_prob,
mean_phn_span=erniesat_config.mean_phn_span,
seg_emb=erniesat_config.model['enc_input_layer'] == 'sega_mlm',
text_masking=False)
gen_raw = True
erniesat_mu, erniesat_std = np.load(args.erniesat_stat)
for datum in test_dataset:
# collate function and dataloader
utt_id = datum["utt_id"]
speech_len = datum["speech_lengths"]
# mask the middle 1/3 speech
left_bdy, right_bdy = speech_len // 3, 2 * speech_len // 3
span_bdy = [left_bdy, right_bdy]
datum.update({"span_bdy": span_bdy})
batch = collate_fn([datum])
with paddle.no_grad():
out_mels = erniesat_inference(
speech=batch["speech"],
text=batch["text"],
masked_pos=batch["masked_pos"],
speech_mask=batch["speech_mask"],
text_mask=batch["text_mask"],
speech_seg_pos=batch["speech_seg_pos"],
text_seg_pos=batch["text_seg_pos"],
span_bdy=span_bdy)
# vocoder
wav_list = []
for mel in out_mels:
part_wav = voc_inference(mel)
wav_list.append(part_wav)
wav = paddle.concat(wav_list)
wav = wav.numpy()
if gen_raw:
speech = datum['speech']
denorm_mel = denorm(speech, erniesat_mu, erniesat_std)
denorm_mel = paddle.to_tensor(denorm_mel)
wav_raw = voc_inference(denorm_mel)
wav_raw = wav_raw.numpy()
sf.write(
str(output_dir / (utt_id + ".wav")),
wav,
samplerate=erniesat_config.fs)
if gen_raw:
sf.write(
str(output_dir / (utt_id + "_raw" + ".wav")),
wav_raw,
samplerate=erniesat_config.fs)
print(f"{utt_id} done!")
def parse_args():
# parse args and config
parser = argparse.ArgumentParser(
description="Synthesize with acoustic model & vocoder")
# ernie sat
parser.add_argument(
'--erniesat_config',
type=str,
default=None,
help='Config of acoustic model.')
parser.add_argument(
'--erniesat_ckpt',
type=str,
default=None,
help='Checkpoint file of acoustic model.')
parser.add_argument(
"--erniesat_stat",
type=str,
default=None,
help="mean and standard deviation used to normalize spectrogram when training acoustic model."
)
parser.add_argument(
"--phones_dict", type=str, default=None, help="phone vocabulary file.")
# vocoder
parser.add_argument(
'--voc',
type=str,
default='pwgan_csmsc',
choices=[
'pwgan_aishell3',
'pwgan_vctk',
'hifigan_aishell3',
'hifigan_vctk',
],
help='Choose vocoder type of tts task.')
parser.add_argument(
'--voc_config', type=str, default=None, help='Config of voc.')
parser.add_argument(
'--voc_ckpt', type=str, default=None, help='Checkpoint file of voc.')
parser.add_argument(
"--voc_stat",
type=str,
default=None,
help="mean and standard deviation used to normalize spectrogram when training voc."
)
# other
parser.add_argument(
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
parser.add_argument("--test_metadata", type=str, help="test metadata.")
parser.add_argument("--output_dir", type=str, help="output dir.")
args = parser.parse_args()
return args
def main():
args = parse_args()
if args.ngpu == 0:
paddle.set_device("cpu")
elif args.ngpu > 0:
paddle.set_device("gpu")
else:
print("ngpu should >= 0 !")
evaluate(args)
if __name__ == "__main__":
main()
此差异已折叠。
此差异已折叠。
此差异已折叠。
001 你好,欢迎使用 Paddle Speech 中英文混合 T T S 功能,开始你的合成之旅吧!
002 我们的声学模型使用了 Fast Speech Two, 声码器使用了 Parallel Wave GAN and Hifi GAN.
003 Paddle N L P 发布 ERNIE Tiny 全系列中文预训练小模型,快速提升预训练模型部署效率,通用信息抽取技术 U I E Tiny 系列模型全新升级,支持速度更快效果更好的 U I E 小模型。
004 Paddle Speech 发布 P P A S R 流式语音识别系统、P P T T S 流式语音合成系统、P P V P R 全链路声纹识别系统。
005 Paddle Bo Bo: 使用 Paddle Speech 的语音合成模块生成虚拟人的声音。
006 热烈欢迎您在 Discussions 中提交问题,并在 Issues 中指出发现的 bug。此外,我们非常希望您参与到 Paddle Speech 的开发中!
007 我喜欢 eat apple, 你喜欢 drink milk。
008 我们要去云南 team building, 非常非常 happy.
\ No newline at end of file
...@@ -69,6 +69,10 @@ model_alias = { ...@@ -69,6 +69,10 @@ model_alias = {
"paddlespeech.t2s.models.wavernn:WaveRNN", "paddlespeech.t2s.models.wavernn:WaveRNN",
"wavernn_inference": "wavernn_inference":
"paddlespeech.t2s.models.wavernn:WaveRNNInference", "paddlespeech.t2s.models.wavernn:WaveRNNInference",
"erniesat":
"paddlespeech.t2s.models.ernie_sat:ErnieSAT",
"erniesat_inference":
"paddlespeech.t2s.models.ernie_sat:ErnieSATInference",
} }
...@@ -112,6 +116,7 @@ def get_test_dataset(test_metadata: List[Dict[str, Any]], ...@@ -112,6 +116,7 @@ def get_test_dataset(test_metadata: List[Dict[str, Any]],
# model: {model_name}_{dataset} # model: {model_name}_{dataset}
am_name = am[:am.rindex('_')] am_name = am[:am.rindex('_')]
am_dataset = am[am.rindex('_') + 1:] am_dataset = am[am.rindex('_') + 1:]
converters = {}
if am_name == 'fastspeech2': if am_name == 'fastspeech2':
fields = ["utt_id", "text"] fields = ["utt_id", "text"]
if am_dataset in {"aishell3", "vctk", if am_dataset in {"aishell3", "vctk",
...@@ -130,8 +135,17 @@ def get_test_dataset(test_metadata: List[Dict[str, Any]], ...@@ -130,8 +135,17 @@ def get_test_dataset(test_metadata: List[Dict[str, Any]],
if voice_cloning: if voice_cloning:
print("voice cloning!") print("voice cloning!")
fields += ["spk_emb"] fields += ["spk_emb"]
elif am_name == 'erniesat':
fields = [
"utt_id", "text", "text_lengths", "speech", "speech_lengths",
"align_start", "align_end"
]
converters = {"speech": np.load}
else:
print("wrong am, please input right am!!!")
test_dataset = DataTable(data=test_metadata, fields=fields) test_dataset = DataTable(
data=test_metadata, fields=fields, converters=converters)
return test_dataset return test_dataset
...@@ -201,6 +215,10 @@ def get_am_inference(am: str='fastspeech2_csmsc', ...@@ -201,6 +215,10 @@ def get_am_inference(am: str='fastspeech2_csmsc',
**am_config["model"]) **am_config["model"])
elif am_name == 'tacotron2': elif am_name == 'tacotron2':
am = am_class(idim=vocab_size, odim=odim, **am_config["model"]) am = am_class(idim=vocab_size, odim=odim, **am_config["model"])
elif am_name == 'erniesat':
am = am_class(idim=vocab_size, odim=odim, **am_config["model"])
else:
print("wrong am, please input right am!!!")
am.set_state_dict(paddle.load(am_ckpt)["main_params"]) am.set_state_dict(paddle.load(am_ckpt)["main_params"])
am.eval() am.eval()
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License. # you may not use this file except in compliance with the License.
...@@ -11,4 +11,6 @@ ...@@ -11,4 +11,6 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
from .ernie_sat import *
from .ernie_sat_updater import *
from .mlm import * from .mlm import *
此差异已折叠。
此差异已折叠。
...@@ -274,9 +274,7 @@ class FastSpeech2(nn.Layer): ...@@ -274,9 +274,7 @@ class FastSpeech2(nn.Layer):
super().__init__() super().__init__()
# store hyperparameters # store hyperparameters
self.idim = idim
self.odim = odim self.odim = odim
self.eos = idim - 1
self.reduction_factor = reduction_factor self.reduction_factor = reduction_factor
self.encoder_type = encoder_type self.encoder_type = encoder_type
self.decoder_type = decoder_type self.decoder_type = decoder_type
......
此差异已折叠。
...@@ -418,7 +418,6 @@ def phones_masking(xs_pad: paddle.Tensor, ...@@ -418,7 +418,6 @@ def phones_masking(xs_pad: paddle.Tensor,
mean_phn_span=mean_phn_span).nonzero() mean_phn_span=mean_phn_span).nonzero()
masked_start = align_start[idx][masked_phn_idxs].tolist() masked_start = align_start[idx][masked_phn_idxs].tolist()
masked_end = align_end[idx][masked_phn_idxs].tolist() masked_end = align_end[idx][masked_phn_idxs].tolist()
for s, e in zip(masked_start, masked_end): for s, e in zip(masked_start, masked_end):
masked_pos[idx, s:e] = 1 masked_pos[idx, s:e] = 1
non_eos_mask = paddle.reshape(src_mask, paddle.shape(xs_pad)[:2]) non_eos_mask = paddle.reshape(src_mask, paddle.shape(xs_pad)[:2])
...@@ -500,14 +499,15 @@ def phones_text_masking(xs_pad: paddle.Tensor, ...@@ -500,14 +499,15 @@ def phones_text_masking(xs_pad: paddle.Tensor,
set(range(length)) - set(masked_phn_idxs[0].tolist())) set(range(length)) - set(masked_phn_idxs[0].tolist()))
np.random.shuffle(unmasked_phn_idxs) np.random.shuffle(unmasked_phn_idxs)
masked_text_idxs = unmasked_phn_idxs[:text_mask_num_lower] masked_text_idxs = unmasked_phn_idxs[:text_mask_num_lower]
text_masked_pos[idx][masked_text_idxs] = 1 text_masked_pos[idx, masked_text_idxs] = 1
masked_start = align_start[idx][masked_phn_idxs].tolist() masked_start = align_start[idx][masked_phn_idxs].tolist()
masked_end = align_end[idx][masked_phn_idxs].tolist() masked_end = align_end[idx][masked_phn_idxs].tolist()
for s, e in zip(masked_start, masked_end): for s, e in zip(masked_start, masked_end):
masked_pos[idx, s:e] = 1 masked_pos[idx, s:e] = 1
non_eos_mask = paddle.reshape(src_mask, paddle.shape(xs_pad)[:2]) non_eos_mask = paddle.reshape(src_mask, shape=paddle.shape(xs_pad)[:2])
masked_pos = masked_pos * non_eos_mask masked_pos = masked_pos * non_eos_mask
non_eos_text_mask = paddle.reshape(text_mask, paddle.shape(xs_pad)[:2]) non_eos_text_mask = paddle.reshape(
text_mask, shape=paddle.shape(text_pad)[:2])
text_masked_pos = text_masked_pos * non_eos_text_mask text_masked_pos = text_masked_pos * non_eos_text_mask
masked_pos = paddle.cast(masked_pos, 'bool') masked_pos = paddle.cast(masked_pos, 'bool')
text_masked_pos = paddle.cast(text_masked_pos, 'bool') text_masked_pos = paddle.cast(text_masked_pos, 'bool')
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册