README.md 4.7 KB
Newer Older
L
lifuchen 已提交
1
# TransformerTTS
2

Z
zhaokexin01 已提交
3
PaddlePaddle dynamic graph implementation of TransformerTTS, a neural TTS with Transformer. The implementation is based on [Neural Speech Synthesis with Transformer Network](https://arxiv.org/abs/1809.08895).
L
lifuchen 已提交
4

L
lifuchen 已提交
5 6 7 8 9 10 11 12
## Dataset

We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).

```bash
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2
```
13

L
lifuchen 已提交
14
## Model Architecture
15

16 17 18 19 20 21
<div align="center" name="TransformerTTS model architecture">
  <img src="./images/model_architecture.jpg" width=400 height=600 /> <br>
</div>
<div align="center" >
TransformerTTS model architecture
</div>
L
lifuchen 已提交
22

Z
zhaokexin01 已提交
23
The model adopts the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in [Tacotron2](https://arxiv.org/abs/1712.05884). The model consists of two main parts, encoder and decoder. We also implement the CBHG model of Tacotron as the vocoder part and convert the spectrogram into raw wave using Griffin-Lim algorithm.
L
lifuchen 已提交
24 25

## Project Structure
26

L
lifuchen 已提交
27 28 29 30 31 32 33
```text
├── config                 # yaml configuration files
├── data.py                # dataset and dataloader settings for LJSpeech
├── synthesis.py           # script to synthesize waveform from text
├── train_transformer.py   # script for transformer model training
├── train_vocoder.py       # script for vocoder model training
```
34

35 36
## Saving & Loading

37 38 39 40 41 42 43
`train_transformer.py` and `train_vocoer.py` have 3 arguments in common, `--checkpoint`, `--iteration` and `--output`.

1. `--output` is the directory for saving results.
During training, checkpoints are saved in `${output}/checkpoints` and tensorboard logs are saved in `${output}/log`.
During synthesis, results are saved in `${output}/samples` and tensorboard log is save in `${output}/log`.

2.  `--checkpoint` is the path of a checkpoint and `--iteration` is the target step. They are used to load checkpoints in the following way.
44

45 46 47 48 49
    - If `--checkpoint` is provided, the checkpoint specified by `--checkpoint` is loaded.

    - If `--checkpoint` is not provided, we try to load the checkpoint of the target step specified by `--iteration` from the `${output}/checkpoints/` directory, e.g. if given `--iteration 120000`, the checkpoint `${output}/checkpoints/step-120000.*` will be load.

    - If both `--checkpoint` and `--iteration` are not provided, we try to load the latest checkpoint from `${output}/checkpoints/` directory.
L
lifuchen 已提交
50 51 52

## Train Transformer

53 54
TransformerTTS model can be trained by running ``train_transformer.py``.

L
lifuchen 已提交
55
```bash
56
python train_transformer.py \
L
lifuchen 已提交
57
--use_gpu=1 \
58 59 60
--data=${DATAPATH} \
--output='./experiment' \
--config='configs/ljspeech.yaml' \
L
lifuchen 已提交
61
```
62

Z
zhaokexin01 已提交
63
Or you can run the script file directly.
64

L
lifuchen 已提交
65 66 67
```bash
sh train_transformer.sh
```
68 69

If you want to train on multiple GPUs, you must start training in the following way.
L
lifuchen 已提交
70 71 72 73 74

```bash
CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog train_transformer.py \
--use_gpu=1 \
75 76 77
--data=${DATAPATH} \
--output='./experiment' \
--config='configs/ljspeech.yaml' \
L
lifuchen 已提交
78 79
```

80
If you wish to resume from an existing model, See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
L
lifuchen 已提交
81

82 83
**Note: In order to ensure the training effect, we recommend using multi-GPU training to enlarge the batch size, and at least 16 samples in single batch per GPU.**

84 85
For more help on arguments

L
lifuchen 已提交
86 87 88
``python train_transformer.py --help``.

## Train Vocoder
89 90 91

Vocoder model can be trained by running ``train_vocoder.py``.

L
lifuchen 已提交
92 93 94
```bash
python train_vocoder.py \
--use_gpu=1 \
95 96 97
--data=${DATAPATH} \
--output='./vocoder' \
--config='configs/ljspeech.yaml' \
L
lifuchen 已提交
98
```
99

Z
zhaokexin01 已提交
100
Or you can run the script file directly.
101

L
lifuchen 已提交
102 103 104
```bash
sh train_vocoder.sh
```
105 106

If you want to train on multiple GPUs, you must start training in the following way.
L
lifuchen 已提交
107 108 109 110 111

```bash
CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog train_vocoder.py \
--use_gpu=1 \
112 113 114
--data=${DATAPATH} \
--output='./vocoder' \
--config='configs/ljspeech.yaml' \
L
lifuchen 已提交
115
```
116

117
If you wish to resume from an existing model, See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
L
lifuchen 已提交
118

119 120
For more help on arguments

L
lifuchen 已提交
121 122 123
``python train_vocoder.py --help``.

## Synthesis
124 125 126

After training the TransformerTTS and vocoder model, audio can be synthesized by running ``synthesis.py``.

L
lifuchen 已提交
127 128
```bash
python synthesis.py \
129 130 131 132 133 134
--max_len=300 \
--use_gpu=1 \
--output='./synthesis' \
--config='configs/ljspeech.yaml' \
--checkpoint_transformer='./checkpoint/transformer/step-120000' \
--checkpoint_vocoder='./checkpoint/vocoder/step-100000' \
L
lifuchen 已提交
135 136
```

Z
zhaokexin01 已提交
137
Or you can run the script file directly.
138

L
lifuchen 已提交
139 140 141 142
```bash
sh synthesis.sh
```

143 144
For more help on arguments

L
lifuchen 已提交
145
``python synthesis.py --help``.