README.md 5.2 KB
Newer Older
L
lifuchen 已提交
1
# Fastspeech
2

Z
zhaokexin01 已提交
3
PaddlePaddle dynamic graph implementation of Fastspeech, a feed-forward network based on Transformer. The implementation is based on [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263).
L
lifuchen 已提交
4

L
lifuchen 已提交
5 6 7 8 9 10 11 12 13 14 15 16 17
## Dataset

We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).

```bash
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2
```

## Model Architecture

![FastSpeech model architecture](./images/model_architecture.png)

Z
zhaokexin01 已提交
18
FastSpeech is a feed-forward structure based on Transformer, instead of using the encoder-attention-decoder based architecture. This model extracts attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length
L
lifuchen 已提交
19 20 21 22 23
regulator to expand the source phoneme sequence to match the length of the target
mel-spectrogram sequence for parallel mel-spectrogram generation. We use the TransformerTTS as teacher model.
The model consists of encoder, decoder and length regulator three parts.

## Project Structure
24

L
lifuchen 已提交
25 26 27 28 29 30
```text
├── config                 # yaml configuration files
├── synthesis.py           # script to synthesize waveform from text
├── train.py               # script for model training
```

31 32
## Saving & Loading

33 34 35 36 37 38 39 40 41
`train_transformer.py` and `train_vocoer.py` have 3 arguments in common, `--checkpoint`, `--iteration` and `--output`.

1. `--output` is the directory for saving results.
During training, checkpoints are saved in `${output}/checkpoints` and tensorboard logs are saved in `${output}/log`.
During synthesis, results are saved in `${output}/samples` and tensorboard log is save in `${output}/log`.

2.  `--checkpoint` is the path of a checkpoint and `--iteration` is the target step. They are used to load checkpoints in the following way.

    - If `--checkpoint` is provided, the checkpoint specified by `--checkpoint` is loaded.
42

43 44 45
    - If `--checkpoint` is not provided, we try to load the checkpoint of the target step specified by `--iteration` from the `${output}/checkpoints/` directory, e.g. if given `--iteration 120000`, the checkpoint `${output}/checkpoints/step-120000.*` will be load.

    - If both `--checkpoint` and `--iteration` are not provided, we try to load the latest checkpoint from `${output}/checkpoints/` directory.
46

L
lifuchen 已提交
47
## Compute Phoneme Duration
48

L
lifuchen 已提交
49 50
A ground truth duration of each phoneme (number of frames in the spectrogram that correspond to that phoneme) should be provided when training a FastSpeech model.

51
We compute the ground truth duration of each phomemes in the following  way.
L
lifuchen 已提交
52 53 54 55
We extract the encoder-decoder attention alignment from a trained Transformer TTS model;
Each frame is considered corresponding to the phoneme that receive the most attention;

You can run alignments/get_alignments.py to get it.
56 57 58 59 60 61 62 63 64 65

```bash
cd alignments
python get_alignments.py \
--use_gpu=1 \
--output='./alignments' \
--data=${DATAPATH} \
--config=${CONFIG} \
--checkpoint_transformer=${CHECKPOINT} \
```
66

L
lifuchen 已提交
67
where `${DATAPATH}` is the path saved LJSpeech data, `${CHECKPOINT}` is the pretrain model path of TransformerTTS, `${CONFIG}` is the config yaml file of TransformerTTS checkpoint. It is necessary for you to prepare a pre-trained TranformerTTS checkpoint.
68

69 70
For more help on arguments

L
lifuchen 已提交
71
``python alignments.py --help``.
72

73 74
Or you can use your own phoneme duration, you just need to process the data into the following format.

75 76 77 78 79 80 81
```bash
{'fname1': alignment1,
'fname2': alignment2,
...}
```

## Train FastSpeech
L
lifuchen 已提交
82

83 84
FastSpeech model can be trained by running ``train.py``.

L
lifuchen 已提交
85 86 87
```bash
python train.py \
--use_gpu=1 \
88 89
--data=${DATAPATH} \
--alignments_path=${ALIGNMENTS_PATH} \
90
--output=${OUTPUTPATH} \
91
--config='configs/ljspeech.yaml' \
L
lifuchen 已提交
92
```
93

Z
zhaokexin01 已提交
94
Or you can run the script file directly.
95

L
lifuchen 已提交
96 97 98
```bash
sh train.sh
```
99 100

If you want to train on multiple GPUs, start training in the following way.
L
lifuchen 已提交
101 102 103 104 105

```bash
CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog train.py \
--use_gpu=1 \
106 107
--data=${DATAPATH} \
--alignments_path=${ALIGNMENTS_PATH} \
108
--output=${OUTPUTPATH} \
109
--config='configs/ljspeech.yaml' \
L
lifuchen 已提交
110
```
111

112
If you wish to resume from an existing model, See [Saving-&-Loading](#Saving-&-Loading) for details of checkpoint loading.
L
lifuchen 已提交
113

114 115
For more help on arguments

L
lifuchen 已提交
116 117 118
``python train.py --help``.

## Synthesis
119 120 121

After training the FastSpeech, audio can be synthesized by running ``synthesis.py``.

L
lifuchen 已提交
122 123 124 125
```bash
python synthesis.py \
--use_gpu=1 \
--alpha=1.0 \
126
--checkpoint=${CHECKPOINTPATH} \
127
--config='configs/ljspeech.yaml' \
128
--output=${OUTPUTPATH} \
129
--vocoder='griffin-lim' \
L
lifuchen 已提交
130
```
131

132
We currently support two vocoders, Griffin-Lim algorithm and WaveFlow. You can set ``--vocoder`` to use one of them. If you want to use WaveFlow as your vocoder, you need to set ``--config_vocoder`` and ``--checkpoint_vocoder`` which are the path of the config and checkpoint of vocoder. You can download the pre-trained model of WaveFlow from [here](https://github.com/PaddlePaddle/Parakeet#vocoders).
L
lifuchen 已提交
133

Z
zhaokexin01 已提交
134
Or you can run the script file directly.
135

L
lifuchen 已提交
136 137 138 139
```bash
sh synthesis.sh
```

140 141
For more help on arguments

L
lifuchen 已提交
142
``python synthesis.py --help``.
143 144

Then you can find the synthesized audio files in ``${OUTPUTPATH}/samples``.