README.md 4.4 KB
Newer Older
L
lifuchen 已提交
1
# TransformerTTS
Z
zhaokexin01 已提交
2
PaddlePaddle dynamic graph implementation of TransformerTTS, a neural TTS with Transformer. The implementation is based on [Neural Speech Synthesis with Transformer Network](https://arxiv.org/abs/1809.08895).
L
lifuchen 已提交
3

L
lifuchen 已提交
4 5 6 7 8 9 10 11 12
## Dataset

We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).

```bash
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2
```
## Model Architecture
13 14 15 16 17 18
<div align="center" name="TransformerTTS model architecture">
  <img src="./images/model_architecture.jpg" width=400 height=600 /> <br>
</div>
<div align="center" >
TransformerTTS model architecture
</div>
L
lifuchen 已提交
19

Z
zhaokexin01 已提交
20
The model adopts the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in [Tacotron2](https://arxiv.org/abs/1712.05884). The model consists of two main parts, encoder and decoder. We also implement the CBHG model of Tacotron as the vocoder part and convert the spectrogram into raw wave using Griffin-Lim algorithm.
L
lifuchen 已提交
21 22 23 24 25 26 27 28 29

## Project Structure
```text
├── config                 # yaml configuration files
├── data.py                # dataset and dataloader settings for LJSpeech
├── synthesis.py           # script to synthesize waveform from text
├── train_transformer.py   # script for transformer model training
├── train_vocoder.py       # script for vocoder model training
```
30 31 32 33 34 35 36 37 38 39
## Saving & Loading
`train_transformer.py` and `train_vocoer.py` have 3 arguments in common, `--checkpooint`, `iteration` and `output`.

1. `output` is the directory for saving results.
During training, checkpoints are saved in `checkpoints/` in `output` and tensorboard log is save in `log/` in `output`.
During synthesis, results are saved in `samples/` in `output` and tensorboard log is save in `log/` in `output`.

2. `--checkpoint` and `--iteration` for loading from existing checkpoint. Loading existing checkpoiont follows the following rule:
If `--checkpoint` is provided, the checkpoint specified by `--checkpoint` is loaded.
If `--checkpoint` is not provided, we try to load the model specified by `--iteration` from the checkpoint directory. If `--iteration` is not provided, we try to load the latested checkpoint from checkpoint directory.
L
lifuchen 已提交
40 41 42

## Train Transformer

Z
zhaokexin01 已提交
43
TransformerTTS model can be trained with ``train_transformer.py``.
L
lifuchen 已提交
44 45 46
```bash
python train_trasformer.py \
--use_gpu=1 \
47 48 49
--data=${DATAPATH} \
--output='./experiment' \
--config='configs/ljspeech.yaml' \
L
lifuchen 已提交
50
```
Z
zhaokexin01 已提交
51
Or you can run the script file directly.
L
lifuchen 已提交
52 53 54
```bash
sh train_transformer.sh
```
55
If you want to train on multiple GPUs, you must start training as follows:
L
lifuchen 已提交
56 57 58 59 60

```bash
CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog train_transformer.py \
--use_gpu=1 \
61 62 63
--data=${DATAPATH} \
--output='./experiment' \
--config='configs/ljspeech.yaml' \
L
lifuchen 已提交
64 65
```

66
If you wish to resume from an existing model, See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
L
lifuchen 已提交
67

68 69
**Note: In order to ensure the training effect, we recommend using multi-GPU training to enlarge the batch size, and at least 16 samples in single batch per GPU.**

L
lifuchen 已提交
70
For more help on arguments:
L
lifuchen 已提交
71 72 73
``python train_transformer.py --help``.

## Train Vocoder
Z
zhaokexin01 已提交
74
Vocoder model can be trained with ``train_vocoder.py``.
L
lifuchen 已提交
75 76 77
```bash
python train_vocoder.py \
--use_gpu=1 \
78 79 80
--data=${DATAPATH} \
--output='./vocoder' \
--config='configs/ljspeech.yaml' \
L
lifuchen 已提交
81
```
Z
zhaokexin01 已提交
82
Or you can run the script file directly.
L
lifuchen 已提交
83 84 85
```bash
sh train_vocoder.sh
```
86
If you want to train on multiple GPUs, you must start training as follows:
L
lifuchen 已提交
87 88 89 90 91

```bash
CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog train_vocoder.py \
--use_gpu=1 \
92 93 94
--data=${DATAPATH} \
--output='./vocoder' \
--config='configs/ljspeech.yaml' \
L
lifuchen 已提交
95
```
96
If you wish to resume from an existing model, See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
L
lifuchen 已提交
97

L
lifuchen 已提交
98
For more help on arguments:
L
lifuchen 已提交
99 100 101
``python train_vocoder.py --help``.

## Synthesis
Z
zhaokexin01 已提交
102
After training the TransformerTTS and vocoder model, audio can be synthesized with ``synthesis.py``.
L
lifuchen 已提交
103 104
```bash
python synthesis.py \
105 106 107 108 109 110
--max_len=300 \
--use_gpu=1 \
--output='./synthesis' \
--config='configs/ljspeech.yaml' \
--checkpoint_transformer='./checkpoint/transformer/step-120000' \
--checkpoint_vocoder='./checkpoint/vocoder/step-100000' \
L
lifuchen 已提交
111 112
```

Z
zhaokexin01 已提交
113
Or you can run the script file directly.
L
lifuchen 已提交
114 115 116 117
```bash
sh synthesis.sh
```

L
lifuchen 已提交
118
For more help on arguments:
L
lifuchen 已提交
119
``python synthesis.py --help``.