README.md 2.6 KB
Newer Older
L
lifuchen 已提交
1
# Fastspeech
Z
zhaokexin01 已提交
2
PaddlePaddle dynamic graph implementation of Fastspeech, a feed-forward network based on Transformer. The implementation is based on [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263).
L
lifuchen 已提交
3

L
lifuchen 已提交
4 5 6 7 8 9 10 11 12 13 14 15 16
## Dataset

We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).

```bash
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar xjvf LJSpeech-1.1.tar.bz2
```

## Model Architecture

![FastSpeech model architecture](./images/model_architecture.png)

Z
zhaokexin01 已提交
17
FastSpeech is a feed-forward structure based on Transformer, instead of using the encoder-attention-decoder based architecture. This model extracts attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length
L
lifuchen 已提交
18 19 20 21 22 23 24 25 26 27 28 29 30
regulator to expand the source phoneme sequence to match the length of the target
mel-spectrogram sequence for parallel mel-spectrogram generation. We use the TransformerTTS as teacher model.
The model consists of encoder, decoder and length regulator three parts.

## Project Structure
```text
├── config                 # yaml configuration files
├── synthesis.py           # script to synthesize waveform from text
├── train.py               # script for model training
```

## Train Transformer

Z
zhaokexin01 已提交
31
FastSpeech model can be trained with ``train.py``.
L
lifuchen 已提交
32 33 34 35 36 37 38 39 40
```bash
python train.py \
--use_gpu=1 \
--use_data_parallel=0 \
--data_path=${DATAPATH} \
--transtts_path='../transformer_tts/checkpoint' \
--transformer_step=160000 \
--config_path='config/fastspeech.yaml' \
```
Z
zhaokexin01 已提交
41
Or you can run the script file directly.
L
lifuchen 已提交
42 43 44
```bash
sh train.sh
```
Z
zhaokexin01 已提交
45
If you want to train on multiple GPUs, you must set ``--use_data_parallel=1``, and then start training as follows:
L
lifuchen 已提交
46 47 48 49 50 51 52 53 54 55 56 57

```bash
CUDA_VISIBLE_DEVICES=0,1,2,3
python -m paddle.distributed.launch --selected_gpus=0,1,2,3 --log_dir ./mylog train.py \
--use_gpu=1 \
--use_data_parallel=1 \
--data_path=${DATAPATH} \
--transtts_path='../transformer_tts/checkpoint' \
--transformer_step=160000 \
--config_path='config/fastspeech.yaml' \
```

Z
zhaokexin01 已提交
58
If you wish to resume from an existing model, please set ``--checkpoint_path`` and ``--fastspeech_step``.
L
lifuchen 已提交
59

L
lifuchen 已提交
60
For more help on arguments:
L
lifuchen 已提交
61 62 63 64 65 66 67 68 69 70 71 72
``python train.py --help``.

## Synthesis
After training the FastSpeech, audio can be synthesized with ``synthesis.py``.
```bash
python synthesis.py \
--use_gpu=1 \
--alpha=1.0 \
--checkpoint_path='checkpoint/' \
--fastspeech_step=112000 \
```

Z
zhaokexin01 已提交
73
Or you can run the script file directly.
L
lifuchen 已提交
74 75 76 77
```bash
sh synthesis.sh
```

L
lifuchen 已提交
78
For more help on arguments:
L
lifuchen 已提交
79
``python synthesis.py --help``.