{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "# 1. PP-TTS 模型介绍\n", "\n", "## 1.1 简介\n", "\n", "PP-TTS 是 PaddleSpeech 自研的流式语音合成系统。在实现[前沿算法](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/released_model.md#text-to-speech-models)的基础上,使用了更快的推理引擎,实现了流式语音合成技术,使其满足商业语音交互场景的需求。\n", "\n", "#### PP-TTS\n", "语音合成基本流程如下图所示:\n", "
\n", "\n", "PP-TTS 默认提供基于 FastSpeech2 声学模型和 HiFiGAN 声码器的中文流式语音合成系统:\n", "\n", "- 文本前端:采用基于规则的中文文本前端系统,对文本正则、多音字、变调等中文文本场景进行了优化。\n", "- 声学模型:对 FastSpeech2 模型的 Decoder 进行改进,使其可以流式合成\n", "- 声码器:支持对 GAN Vocoder 的流式合成\n", "- 推理引擎:使用 ONNXRuntime 推理引擎优化模型推理性能,使得语音合成系统在低压 CPU 上也能达到 RTF<1,满足流式合成的要求\n", "\n", "\n", "## 1.2 特性\n", "- 开源领先的中文语音合成系统\n", "- 使用 ONNXRuntime 推理引擎优化模型推理性能\n", "- 唯一开源的流式语音合成系统\n", "- 易拆卸性:可以很方便地更换不同语种上的不同声学模型和声码器、使用不同的推理引擎(Paddle 动态图、PaddleInference 和 ONNXRuntime 等)、使用不同的网络服务(HTTP、Websocket)\n", "\n", "\n", "# 2. 模型效果及应用场景\n", "## 2.1 语音合成任务\n", "## 2.1.1 数据集:\n", "常见语音合成数据集如下表所示:\n", "\n", "| 语言 | 数据集 |音频信息 | 描述 |\n", "| -------- | -------- | -------- | -------- |\n", "| 中文 | [CSMSC](https://www.data-baker.com/open_source.html) | 48KHz, 16bit | 单说话人,女声,约12小时,具有高音频质量 |\n", "| 中文 | [AISHELL-3](http://www.aishelltech.com/aishell_3) | 44.1kHz,16bit | 多说话人(218人),约85小时,音频质量不一致(有的说话人音频质量较高)|\n", "| 英文 | [LJSpeech-1.1](https://keithito.com/LJ-Speech-Dataset/) | 22050Hz, 16bit | 单说话人,女声,约24小时,具有高音频质量|\n", "| 英文 | [VCTK](https://datashare.ed.ac.uk/handle/10283/3443) | 48kHz, 16bit | 多说话人(110人), 约44小时,音频质量不一致(有的说话人音频质量较高)|\n", "\n", "## 2.1.2 模型效果速览\n", "点击 [链接](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html) 听合成的音频。\n", "# 3. 模型如何使用\n", "## 3.1 模型推理\n", "### 安装 paddlespeech" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "!pip install paddlespeech" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# paddlespeech 依赖需要用到 nltk 包,但是有时会因为网络原因导致不好下载,此处手动下载一下放到百度服务器的包\n", "!wget https://paddlespeech.bj.bcebos.com/Parakeet/tools/nltk_data.tar.gz\n", "!tar zxvf nltk_data.tar.gz" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from paddlespeech.cli.tts import TTSExecutor\n", "tts_executor = TTSExecutor()\n", "wav_file = tts_executor(\n", " text=\"热烈欢迎您在 Discussions 中提交问题,并在 Issues 中指出发现的 bug。此外,我们非常希望您参与到 Paddle Speech 的开发中!\",\n", " output='output.wav',\n", " am='fastspeech2_mix',\n", " voc='hifigan_csmsc',\n", " lang='mix',\n", " spk_id=174)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import IPython.display as dp\n", "dp.Audio('output.wav')" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## 3.2 模型训练\n", "- [基于 CSMCS 数据集训练 FastSpeech2 模型](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)\n", "- [基于 CSMCS 数据集训练 HiFiGAN 模型](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc5)\n", "# 4. 模型原理\n", "\n", "\n", "### 4.1 声学模型 FastSpeech2\n", "我们使用 `FastSpeech2` 作为声学模型。\n", "
\n", "
FastSpeech2 网络结构图

\n", "\n", "\n", "PaddleSpeech TTS 实现的 FastSpeech2 与论文不同的地方在于,我们使用的的是 phone 级别的 `pitch` 和 `energy`(与 FastPitch 类似),这样的合成结果可以更加**稳定**。\n", "
\n", "
FastPitch 网络结构图

\n", "\n", "### 4.2 声码器 HiFiGAN\n", "我们使用 `HiFiGAN` 作为声码器。\n", "\n", "1. 引入了多周期判别器(Multi-Period Discriminator,MPD)。HiFiGAN 同时拥有多尺度判别器(Multi-Scale Discriminator,MSD)和多周期判别器,目标就是尽可能增强 GAN 判别器甄别合成或真实音频的能力。\n", "2. 生成器中提出了多感受野融合模块。WaveNet为了增大感受野,叠加带洞卷积,逐样本点生成,音质确实很好,但是也使得模型较大,推理速度较慢。HiFiGAN 则提出了一种残差结构,交替使用带洞卷积和普通卷积增大感受野,保证合成音质的同时,提高推理速度。\n", "\n", "\"hifigan\"\n", "\n", "\n", "# 5. 注意事项\n", "# 6. 相关论文以及引用信息\n", "```text\n", "@article{ren2020fastspeech,\n", " title={Fastspeech 2: Fast and high-quality end-to-end text to speech},\n", " author={Ren, Yi and Hu, Chenxu and Tan, Xu and Qin, Tao and Zhao, Sheng and Zhao, Zhou and Liu, Tie-Yan},\n", " journal={arXiv preprint arXiv:2006.04558},\n", " year={2020}\n", "}\n", "\n", "@article{kong2020hifi,\n", " title={Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis},\n", " author={Kong, Jungil and Kim, Jaehyeon and Bae, Jaekyoung},\n", " journal={Advances in Neural Information Processing Systems},\n", " volume={33},\n", " pages={17022--17033},\n", " year={2020}\n", "}\n", "```" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "py35-paddle1.2.0" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 1 }