未验证 提交 499de374 编写于 作者: 小湉湉's avatar 小湉湉 提交者: GitHub

add PP-TTS (#5557)

上级 88982eae
import os
import gradio as gr
from paddlespeech.cli.tts import TTSExecutor
tts_executor = TTSExecutor()
def speech_generate(text: str) -> os.PathLike:
assert isinstance(text,
str) and len(text) > 0, 'Input Chinese-English text...'
wav_file = tts_executor(
text=text,
output='output.wav',
am='fastspeech2_mix',
voc='hifigan_csmsc',
lang='mix',
spk_id=174)
return wav_file
def clear_all():
return None, None
with gr.Blocks() as demo:
gr.Markdown("Text to Speech")
with gr.Column(scale=1, min_width=50):
text_input = gr.Textbox(placeholder="Type here...", lines=5)
with gr.Row():
btn1 = gr.Button("Clear")
btn2 = gr.Button("Submit")
audio_output = gr.Audio(type="file", label="Output")
btn2.click(
fn=speech_generate,
inputs=text_input,
outputs=audio_output,
scroll_to_output=True)
btn1.click(fn=clear_all, inputs=None, outputs=[text_input, audio_output])
gr.Button.style(1)
demo.launch(share=True)
【PP-TTS-App-YAML】
APP_Info:
title: PP-TTS-App
colorFrom: blue
colorTo: yellow
sdk: gradio
sdk_version: 3.4.1
app_file: app.py
license: apache-2.0
device: cpu
\ No newline at end of file
paddlepaddle==2.3.2
paddleaudio==1.0.1
paddlespeech==1.2.0
\ No newline at end of file
## 1. 训练 Benchmark
### 1.1 软硬件环境
* FastSpeech2 模型训练过程中使用 2 GPUs,每 GPU batch size为 64 进行训练。
* HiFiGAN 模型训练过程中使用 1 GPU,每 GPU batch size为 16 进行训练。
* python 版本: 3.7.0
* paddle 版本: v2.4.0rc0
* 机器: 8x Tesla V100-SXM2-32GB, 24 core Intel(R) Xeon(R) Gold 6148, 100Gbps RDMA network
### 1.2 数据集
| 语言 | 数据集 |音频信息 | 描述 |
| -------- | -------- | -------- | -------- |
| 中文 | [CSMSC](https://www.data-baker.com/open_source.html) | 48KHz, 16bit | 单说话人,女声,约12小时,具有高音频质量 |
| 中文 | [AISHELL-3](http://www.aishelltech.com/aishell_3) | 44.1kHz,16bit | 多说话人(218人),约85小时,音频质量不一致(有的说话人音频质量较高)|
| 英文 | [LJSpeech-1.1](https://keithito.com/LJ-Speech-Dataset/) | 22050Hz, 16bit | 单说话人,女声,约24小时,具有高音频质量|
| 英文 | [VCTK](https://datashare.ed.ac.uk/handle/10283/3443) | 48kHz, 16bit | 多说话人(110人),约44小时,音频质量不一致(有的说话人音频质量较高)|
### 1.3 指标
|模型名称 | 模型简介 | 模型体积 | ips |
|---|---|---|---|
|fastspeech2_mix |语音合成声学模型|388MB|135 sequences/sec|
|hifigan_csmsc|语音合成声码器|873MB|30 sequences/sec|
## 2. 推理 Benchmark
参考 [TTS-Benchmark](https://github.com/PaddlePaddle/PaddleSpeech/wiki/TTS-Benchmark)
## 3. 相关使用说明
## 1. Training Benchmark
### 1.1 Environment
* FastSpeech2,2 GPUs,batch size = 64 per GPU.
* HiFiGAN,1 GPU,GPU batch size = 16 per GPU.
* python version: 3.7.0
* paddle version: v2.4.0rc0
* machine: 8x Tesla V100-SXM2-32GB, 24 core Intel(R) Xeon(R) Gold 6148, 100Gbps RDMA network
### 1.2 Datasets
| language | dataset |audio info | describtion |
| -------- | -------- | -------- | -------- |
| Chinese | [CSMSC](https://www.data-baker.com/open_source.html) | 48KHz, 16bit | single speaker,female,12 h|
| Chinese | [AISHELL-3](http://www.aishelltech.com/aishell_3) | 44.1kHz,16bit |multi-speakers,85 h|
| English | [LJSpeech-1.1](https://keithito.com/LJ-Speech-Dataset/) | 22050Hz, 16bit | single speaker,female,24 h|
| English | [VCTK](https://datashare.ed.ac.uk/handle/10283/3443) | 48kHz, 16bit | multi-speakers,44 h|
### 1.3 Benchmark
|model | task | model_size | ips |
|---|---|---|---|
|fastspeech2_mix |TTS Acoustic Model|388MB|135 sequences/sec|
|hifigan_csmsc|TTS Vocoder|873MB|30 sequences/sec|
## 2. Inference Benchmark
Please refer to [TTS-Benchmark](https://github.com/PaddlePaddle/PaddleSpeech/wiki/TTS-Benchmark).
## 3. Reference
# 下载
|模型名称 | 模型简介 | 模型体积 | 下载地址 |
|---|---|---|---|
|fastspeech2_mix |语音合成声学模型|388MB|[推理模型](https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_mix_ckpt_0.2.0.zip)|
|hifigan_csmsc|语音合成声码器|873MB|[推理模型](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip)|
\ No newline at end of file
# Download
|model | task | model_size | download |
|---|---|---|---|
|fastspeech2_mix |TTS Acoustic Model|388MB|[inference_model](https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_mix_ckpt_0.2.0.zip)|
|hifigan_csmsc|TTS Vocoder|873MB|[inference_model](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip)|
\ No newline at end of file
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"# 1. PP-TTS 模型介绍\n",
"\n",
"## 1.1 简介\n",
"\n",
"PP-TTS 是 PaddleSpeech 自研的流式语音合成系统。在实现[前沿算法](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/released_model.md#text-to-speech-models)的基础上,使用了更快的推理引擎,实现了流式语音合成技术,使其满足商业语音交互场景的需求。\n",
"\n",
"#### PP-TTS\n",
"语音合成基本流程如下图所示:\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/ea69ae1faff84940a59c7079d16b3a8db2741d2c423846f68822f4a7f28726e9 width=\"600\" ></center>\n",
"\n",
"PP-TTS 默认提供基于 FastSpeech2 声学模型和 HiFiGAN 声码器的中文流式语音合成系统:\n",
"\n",
"- 文本前端:采用基于规则的中文文本前端系统,对文本正则、多音字、变调等中文文本场景进行了优化。\n",
"- 声学模型:对 FastSpeech2 模型的 Decoder 进行改进,使其可以流式合成\n",
"- 声码器:支持对 GAN Vocoder 的流式合成\n",
"- 推理引擎:使用 ONNXRuntime 推理引擎优化模型推理性能,使得语音合成系统在低压 CPU 上也能达到 RTF<1,满足流式合成的要求\n",
"\n",
"\n",
"## 1.2 特性\n",
"- 开源领先的中文语音合成系统\n",
"- 使用 ONNXRuntime 推理引擎优化模型推理性能\n",
"- 唯一开源的流式语音合成系统\n",
"- 易拆卸性:可以很方便地更换不同语种上的不同声学模型和声码器、使用不同的推理引擎(Paddle 动态图、PaddleInference 和 ONNXRuntime 等)、使用不同的网络服务(HTTP、Websocket)\n",
"\n",
"\n",
"# 2. 模型效果及应用场景\n",
"## 2.1 语音合成任务\n",
"## 2.1.1 数据集:\n",
"常见语音合成数据集如下表所示:\n",
"\n",
"| 语言 | 数据集 |音频信息 | 描述 |\n",
"| -------- | -------- | -------- | -------- |\n",
"| 中文 | [CSMSC](https://www.data-baker.com/open_source.html) | 48KHz, 16bit | 单说话人,女声,约12小时,具有高音频质量 |\n",
"| 中文 | [AISHELL-3](http://www.aishelltech.com/aishell_3) | 44.1kHz,16bit | 多说话人(218人),约85小时,音频质量不一致(有的说话人音频质量较高)|\n",
"| 英文 | [LJSpeech-1.1](https://keithito.com/LJ-Speech-Dataset/) | 22050Hz, 16bit | 单说话人,女声,约24小时,具有高音频质量|\n",
"| 英文 | [VCTK](https://datashare.ed.ac.uk/handle/10283/3443) | 48kHz, 16bit | 多说话人(110人), 约44小时,音频质量不一致(有的说话人音频质量较高)|\n",
"\n",
"## 2.1.2 模型效果速览\n",
"点击 [链接](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html) 听合成的音频。\n",
"# 3. 模型如何使用\n",
"## 3.1 模型推理\n",
"### 安装 paddlespeech"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"!pip install paddlespeech"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# paddlespeech 依赖需要用到 nltk 包,但是有时会因为网络原因导致不好下载,此处手动下载一下放到百度服务器的包\n",
"!wget https://paddlespeech.bj.bcebos.com/Parakeet/tools/nltk_data.tar.gz\n",
"!tar zxvf nltk_data.tar.gz"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from paddlespeech.cli.tts import TTSExecutor\n",
"tts_executor = TTSExecutor()\n",
"wav_file = tts_executor(\n",
" text=\"热烈欢迎您在 Discussions 中提交问题,并在 Issues 中指出发现的 bug。此外,我们非常希望您参与到 Paddle Speech 的开发中!\",\n",
" output='output.wav',\n",
" am='fastspeech2_mix',\n",
" voc='hifigan_csmsc',\n",
" lang='mix',\n",
" spk_id=174)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import IPython.display as dp\n",
"dp.Audio('output.wav')"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"## 3.2 模型训练\n",
"- [基于 CSMCS 数据集训练 FastSpeech2 模型](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)\n",
"- [基于 CSMCS 数据集训练 HiFiGAN 模型](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc5)\n",
"# 4. 模型原理\n",
"\n",
"\n",
"### 4.1 声学模型 FastSpeech2\n",
"我们使用 `FastSpeech2` 作为声学模型。\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/6b6d671713ec4d20a0e60653c7a5d4ae3c35b1d1e58b4cc39e0bc82ad4a341d9\"></center>\n",
"<br><center> FastSpeech2 网络结构图</center></br>\n",
"\n",
"\n",
"PaddleSpeech TTS 实现的 FastSpeech2 与论文不同的地方在于,我们使用的的是 phone 级别的 `pitch` 和 `energy`(与 FastPitch 类似),这样的合成结果可以更加**稳定**。\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/862c21456c784c41a83a308b7d9707f0810cc3b3c6f94ed48c60f5d32d0072f0\"></center>\n",
"<br><center> FastPitch 网络结构图</center></br>\n",
"\n",
"### 4.2 声码器 HiFiGAN\n",
"我们使用 `HiFiGAN` 作为声码器。\n",
"\n",
"1. 引入了多周期判别器(Multi-Period Discriminator,MPD)。HiFiGAN 同时拥有多尺度判别器(Multi-Scale Discriminator,MSD)和多周期判别器,目标就是尽可能增强 GAN 判别器甄别合成或真实音频的能力。\n",
"2. 生成器中提出了多感受野融合模块。WaveNet为了增大感受野,叠加带洞卷积,逐样本点生成,音质确实很好,但是也使得模型较大,推理速度较慢。HiFiGAN 则提出了一种残差结构,交替使用带洞卷积和普通卷积增大感受野,保证合成音质的同时,提高推理速度。\n",
"\n",
"<img width=\"1054\" alt=\"hifigan\" src=\"https://user-images.githubusercontent.com/24568452/200246150-bad56215-a1ce-4536-9230-bbadc0ce57b6.png\">\n",
"\n",
"\n",
"# 5. 注意事项\n",
"# 6. 相关论文以及引用信息\n",
"```text\n",
"@article{ren2020fastspeech,\n",
" title={Fastspeech 2: Fast and high-quality end-to-end text to speech},\n",
" author={Ren, Yi and Hu, Chenxu and Tan, Xu and Qin, Tao and Zhao, Sheng and Zhao, Zhou and Liu, Tie-Yan},\n",
" journal={arXiv preprint arXiv:2006.04558},\n",
" year={2020}\n",
"}\n",
"\n",
"@article{kong2020hifi,\n",
" title={Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis},\n",
" author={Kong, Jungil and Kim, Jaehyeon and Bae, Jaekyoung},\n",
" journal={Advances in Neural Information Processing Systems},\n",
" volume={33},\n",
" pages={17022--17033},\n",
" year={2020}\n",
"}\n",
"```"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "py35-paddle1.2.0"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"# 1. PP-TTS Introduction\n",
"\n",
"## 1.1 Introduction\n",
"\n",
"PP-TTS is a streaming speech synthesis system developed by PaddleSpeech. Based on the implementation of [SOTA Algorithms](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/released_model.md#text-to-speech-models), a faster inference engine is used to realize streaming speech synthesis technology to meet the needs of commercial speech interaction scenarios.\n",
"\n",
"\n",
"#### PP-TTS\n",
"<center><img src=https://ai-studio-static-online.cdn.bcebos.com/ea69ae1faff84940a59c7079d16b3a8db2741d2c423846f68822f4a7f28726e9 width=\"600\" ></center>\n",
"\n",
"PP-TTS provides a Chinese streaming speech synthesis system based on FastSpeech2 and HiFiGAN by default:\n",
"\n",
"- Text Frontend: The rule-based Chinese text frontend system is adopted to optimize Chinese text such as text normalization, polyphony, and tone sandhi.\n",
"- Acoustic Model: The decoder of FastSpeech2 is improved so that it can be stream synthesized\n",
"- Vocoder: Streaming synthesis of GAN vocoder is supported\n",
"- Inference Engine: Using ONNXRuntime to optimize the inference of TTS models, so that the TTS system can also achieve RTF < 1 on low-voltage, meeting the requirements of streaming synthesis\n",
"\n",
"\n",
"## 1.2 Characteristic\n",
"- Open source leading Chinese TTS system\n",
"- Using ONNXRuntime to optimize the inference of TTS models\n",
"- The only open-source streaming TTS system\n",
"- Easy disassembly: Developers can easily replace different acoustic models and vocoders in different languages, use different inference engines (Paddle dynamic graph, PaddleInference, ONNXRuntime, etc.), and use different network services (HTTP, WebSocket)\n",
"\n",
"\n",
"# 2. Model Effects and Application Scenarios\n",
"## 2.1 TTS\n",
"## 2.1.1 Datasets:\n",
"\n",
"Common TTS datasets are shown in the following table:\n",
"\n",
"| language | dataset |audio info | describtion |\n",
"| -------- | -------- | -------- | -------- |\n",
"| Chinese | [CSMSC](https://www.data-baker.com/open_source.html) | 48KHz, 16bit | single speaker,female,12 h|\n",
"| Chinese | [AISHELL-3](http://www.aishelltech.com/aishell_3) | 44.1kHz,16bit |multi-speakers,85 h|\n",
"| English | [LJSpeech-1.1](https://keithito.com/LJ-Speech-Dataset/) | 22050Hz, 16bit | single speaker,female,24 h|\n",
"| English | [VCTK](https://datashare.ed.ac.uk/handle/10283/3443) | 48kHz, 16bit | multi-speakers,44 h|\n",
"\n",
"## 2.1.2 Model Effects\n",
"Click [link](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html) to listen synthetisized audios.\n",
"\n",
"# 3. How to Use the Model\n",
"## 3.1 Model Inference\n",
"### Install paddlespeech"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"!pip install paddlespeech"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"!wget https://paddlespeech.bj.bcebos.com/Parakeet/tools/nltk_data.tar.gz\n",
"!tar zxvf nltk_data.tar.gz"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from paddlespeech.cli.tts import TTSExecutor\n",
"tts_executor = TTSExecutor()\n",
"wav_file = tts_executor(\n",
" text=\"热烈欢迎您在 Discussions 中提交问题,并在 Issues 中指出发现的 bug。此外,我们非常希望您参与到 Paddle Speech 的开发中!\",\n",
" output='output.wav',\n",
" am='fastspeech2_mix',\n",
" voc='hifigan_csmsc',\n",
" lang='mix',\n",
" spk_id=174)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import IPython.display as dp\n",
"dp.Audio('output.wav')"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"## 3.2 Model Training\n",
"- [train FastSpeech2 with CSMCS dataset](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)\n",
"- [train HiFiGAN with CSMCS dataset](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc5)\n",
"# 4. Model Principles\n",
"\n",
"\n",
"### 4.1 Acoustic Model FastSpeech2\n",
"We use `FastSpeech2` as acoustic model.\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/6b6d671713ec4d20a0e60653c7a5d4ae3c35b1d1e58b4cc39e0bc82ad4a341d9\"></center>\n",
"<br><center> FastSpeech2 Model Structure</center></br>\n",
"\n",
"\n",
"PaddleSpeech TTS's FastSpeech2 is different with paper,we use phone level `pitch` and `energy`(which is more like FastPitch),this can make the synthesized audio more **stable**.\n",
"<center><img src=\"https://ai-studio-static-online.cdn.bcebos.com/862c21456c784c41a83a308b7d9707f0810cc3b3c6f94ed48c60f5d32d0072f0\"></center>\n",
"<br><center> FastPitch Model Structure</center></br>\n",
"\n",
"### 4.2 Vocoder HiFiGAN\n",
"We use `HiFiGAN` as Vocoder.\n",
"\n",
"1. Introduced MPD(Multi-Period Discriminator)。HiFiGAN both have MSD(Multi-Scale Discriminator)and MPD(Multi-Period Discriminator),The goal is to enhance the ability of GAN discriminator to distinguish synthetic or real audio as much as possible.\n",
"2. Generator introduced Multi receptive field fusion module. In order to increase the receptive field, WaveNet superimposes hole convolution and generates by sample point. The sound quality is really good, but it also makes the model larger and the reasoning speed slower. HiFiGAN proposed a residual structure, which uses the dilated convolution and vanilla convolution alternately to increase the receptive field, ensure the synthetic sound quality and improve the reasoning speed.\n",
"\n",
"<img width=\"1054\" alt=\"hifigan\" src=\"https://user-images.githubusercontent.com/24568452/200246150-bad56215-a1ce-4536-9230-bbadc0ce57b6.png\">\n",
"\n",
"<br><center> HiFiGAN Model Structure</center></br>\n",
"\n",
"\n",
"# 5. Attention\n",
"# 6. Related papers and citations\n",
"```text\n",
"@article{ren2020fastspeech,\n",
" title={Fastspeech 2: Fast and high-quality end-to-end text to speech},\n",
" author={Ren, Yi and Hu, Chenxu and Tan, Xu and Qin, Tao and Zhao, Sheng and Zhao, Zhou and Liu, Tie-Yan},\n",
" journal={arXiv preprint arXiv:2006.04558},\n",
" year={2020}\n",
"}\n",
"\n",
"@article{kong2020hifi,\n",
" title={Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis},\n",
" author={Kong, Jungil and Kim, Jaehyeon and Bae, Jaekyoung},\n",
" journal={Advances in Neural Information Processing Systems},\n",
" volume={33},\n",
" pages={17022--17033},\n",
" year={2020}\n",
"}\n",
"```"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "py35-paddle1.2.0"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册