# 1. PP-TTS Introduction

## 1.1 Introduction

PP-TTS is a streaming speech synthesis system developed by PaddleSpeech. Based on the implementation of [SOTA Algorithms](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/released_model.md#text-to-speech-models), a faster inference engine is used to realize streaming speech synthesis technology to meet the needs of commercial speech interaction scenarios.


#### PP-TTS
<center><img src=https://ai-studio-static-online.cdn.bcebos.com/ea69ae1faff84940a59c7079d16b3a8db2741d2c423846f68822f4a7f28726e9 width="600" ></center>

PP-TTS provides a Chinese streaming speech synthesis system based on FastSpeech2 and HiFiGAN by default:

- Text Frontend： The rule-based Chinese text frontend system is adopted to optimize Chinese text such as text normalization, polyphony, and tone sandhi.
- Acoustic Model: The decoder of FastSpeech2 is improved so that it can be stream synthesized
- Vocoder: Streaming synthesis of GAN vocoder is supported
- Inference Engine： Using ONNXRuntime to optimize the inference of TTS models, so that the TTS system can also achieve RTF < 1 on low-voltage, meeting the requirements of streaming synthesis


## 1.2 Characteristic
- Open source leading Chinese TTS system
- Using ONNXRuntime to optimize the inference of TTS models
- The only open-source streaming TTS system
- Easy disassembly: Developers can easily replace different acoustic models and vocoders in different languages, use different inference engines (Paddle dynamic graph, PaddleInference, ONNXRuntime, etc.), and use different network services (HTTP, WebSocket)


# 2. Model Effects and Application Scenarios
## 2.1 TTS
## 2.1.1 Datasets：

Common TTS datasets are shown in the following table:

| language | dataset |audio info | describtion |
| -------- | -------- | -------- | -------- |
| Chinese | [CSMSC](https://www.data-baker.com/open_source.html) | 48KHz, 16bit | single speaker，female，12 h|
| Chinese | [AISHELL-3](http://www.aishelltech.com/aishell_3) | 44.1kHz，16bit |multi-speakers，85 h|
| English | [LJSpeech-1.1](https://keithito.com/LJ-Speech-Dataset/) | 22050Hz, 16bit | single speaker，female，24 h|
| English | [VCTK](https://datashare.ed.ac.uk/handle/10283/3443) | 48kHz, 16bit | multi-speakers，44 h|

## 2.1.2 Model Effects
Click [link](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html) to listen synthetisized audios.

# 3. How to Use the Model
## 3.1 Model Inference
### Install paddlespeech

In [1]:
!pip install paddlespeech

In [2]:
!wget https://paddlespeech.bj.bcebos.com/Parakeet/tools/nltk_data.tar.gz
!tar zxvf nltk_data.tar.gz

In [3]:
from paddlespeech.cli.tts import TTSExecutor
tts_executor = TTSExecutor()
wav_file = tts_executor(
    text="热烈欢迎您在 Discussions 中提交问题，并在 Issues 中指出发现的 bug。此外，我们非常希望您参与到 Paddle Speech 的开发中！",
    output='output.wav',
    am='fastspeech2_mix',
    voc='hifigan_csmsc',
    lang='mix',
    spk_id=174)

In [5]:
import IPython.display as dp
dp.Audio('output.wav')

## 3.2 Model Training
- [train FastSpeech2 with CSMCS dataset](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)
- [train HiFiGAN with CSMCS dataset](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc5)
# 4. Model Principles


### 4.1 Acoustic Model  FastSpeech2
We use `FastSpeech2` as acoustic model.
<center><img src="https://ai-studio-static-online.cdn.bcebos.com/6b6d671713ec4d20a0e60653c7a5d4ae3c35b1d1e58b4cc39e0bc82ad4a341d9"></center>
<br><center> FastSpeech2 Model Structure</center></br>


PaddleSpeech TTS's FastSpeech2 is different with paper，we use phone level `pitch` and `energy`(which is more like FastPitch)，this can make the synthesized audio more **stable**.
<center><img src="https://ai-studio-static-online.cdn.bcebos.com/862c21456c784c41a83a308b7d9707f0810cc3b3c6f94ed48c60f5d32d0072f0"></center>
<br><center> FastPitch Model Structure</center></br>

### 4.2 Vocoder HiFiGAN
We use `HiFiGAN` as Vocoder.

1. Introduced MPD（Multi-Period Discriminator）。HiFiGAN both have MSD（Multi-Scale Discriminator）and MPD（Multi-Period Discriminator），The goal is to enhance the ability of GAN discriminator to distinguish synthetic or real audio as much as possible.
2. Generator introduced Multi receptive field fusion module. In order to increase the receptive field, WaveNet superimposes hole convolution and generates by sample point. The sound quality is really good, but it also makes the model larger and the reasoning speed slower. HiFiGAN proposed a residual structure, which uses the dilated convolution and vanilla convolution alternately to increase the receptive field, ensure the synthetic sound quality and improve the reasoning speed.

<img width="1054" alt="hifigan" src="https://user-images.githubusercontent.com/24568452/200246150-bad56215-a1ce-4536-9230-bbadc0ce57b6.png">

<br><center> HiFiGAN Model Structure</center></br>


# 5. Attention
# 6. Related papers and citations
```text
@article{ren2020fastspeech,
  title={Fastspeech 2: Fast and high-quality end-to-end text to speech},
  author={Ren, Yi and Hu, Chenxu and Tan, Xu and Qin, Tao and Zhao, Sheng and Zhao, Zhou and Liu, Tie-Yan},
  journal={arXiv preprint arXiv:2006.04558},
  year={2020}
}

@article{kong2020hifi,
  title={Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis},
  author={Kong, Jungil and Kim, Jaehyeon and Bae, Jaekyoung},
  journal={Advances in Neural Information Processing Systems},
  volume={33},
  pages={17022--17033},
  year={2020}
}
```