README.md 13.6 KB
Newer Older
小湉湉's avatar
小湉湉 已提交
1
# FastSpeech2 with LJSpeech-1.1
H
Hui Zhang 已提交
2 3 4
This example contains code used to train a [Fastspeech2](https://arxiv.org/abs/2006.04558) model with [LJSpeech-1.1](https://keithito.com/LJ-Speech-Dataset/).

## Dataset
小湉湉's avatar
小湉湉 已提交
5
### Download and Extract
小湉湉's avatar
小湉湉 已提交
6
Download LJSpeech-1.1 from it's [Official Website](https://keithito.com/LJ-Speech-Dataset/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/LJSpeech-1.1`.
H
Hui Zhang 已提交
7

小湉湉's avatar
小湉湉 已提交
8
### Get MFA Result and Extract
H
Hui Zhang 已提交
9
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2.
小湉湉's avatar
小湉湉 已提交
10
You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
H
Hui Zhang 已提交
11

小湉湉's avatar
小湉湉 已提交
12
## Get Started
H
Hui Zhang 已提交
13 14
Assume the path to the dataset is `~/datasets/LJSpeech-1.1`.
Assume the path to the MFA result of LJSpeech-1.1 is `./ljspeech_alignment`.
小湉湉's avatar
小湉湉 已提交
15 16
Run the command below to
1. **source path**.
17
2. preprocess the dataset.
小湉湉's avatar
小湉湉 已提交
18 19 20 21 22 23 24
3. train the model.
4. synthesize wavs.
    - synthesize waveform from `metadata.jsonl`.
    - synthesize waveform from text file.
```bash
./run.sh
```
小湉湉's avatar
小湉湉 已提交
25
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
小湉湉's avatar
小湉湉 已提交
26 27 28 29
```bash
./run.sh --stage 0 --stop-stage 0
```
### Data Preprocessing
H
Hui Zhang 已提交
30
```bash
小湉湉's avatar
小湉湉 已提交
31
./local/preprocess.sh ${conf_path}
H
Hui Zhang 已提交
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
```
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.

```text
dump
├── dev
│   ├── norm
│   └── raw
├── phone_id_map.txt
├── speaker_id_map.txt
├── test
│   ├── norm
│   └── raw
└── train
    ├── energy_stats.npy
    ├── norm
    ├── pitch_stats.npy
    ├── raw
    └── speech_stats.npy
```
小湉湉's avatar
小湉湉 已提交
52
The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech、pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
H
Hui Zhang 已提交
53

小湉湉's avatar
小湉湉 已提交
54
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, the path of pitch features, the path of energy features, speaker, and id of each utterance.
H
Hui Zhang 已提交
55

小湉湉's avatar
小湉湉 已提交
56
### Model Training
小湉湉's avatar
小湉湉 已提交
57
`./local/train.sh` calls `${BIN_DIR}/train.py`.
H
Hui Zhang 已提交
58
```bash
小湉湉's avatar
小湉湉 已提交
59
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
H
Hui Zhang 已提交
60 61 62 63 64
```
Here's the complete help message.
```text
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
                [--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
65 66
                [--ngpu NGPU] [--phones-dict PHONES_DICT]
                [--speaker-dict SPEAKER_DICT] [--voice-cloning VOICE_CLONING]
H
Hui Zhang 已提交
67 68 69 70 71 72 73 74 75 76 77 78

Train a FastSpeech2 model.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       fastspeech2 config file.
  --train-metadata TRAIN_METADATA
                        training data.
  --dev-metadata DEV_METADATA
                        dev data.
  --output-dir OUTPUT_DIR
                        output dir.
79
  --ngpu NGPU           if ngpu=0, use cpu.
H
Hui Zhang 已提交
80 81 82 83
  --phones-dict PHONES_DICT
                        phone vocabulary file.
  --speaker-dict SPEAKER_DICT
                        speaker id map file for multiple speaker model.
84 85
  --voice-cloning VOICE_CLONING
                        whether training voice cloning model.
H
Hui Zhang 已提交
86 87 88
```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
小湉湉's avatar
小湉湉 已提交
89
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
90 91
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
5. `--phones-dict` is the path of the phone vocabulary file.
H
Hui Zhang 已提交
92

小湉湉's avatar
小湉湉 已提交
93
### Synthesizing
94
We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc1) as the neural vocoder.
小湉湉's avatar
小湉湉 已提交
95
Download pretrained parallel wavegan model from [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip) and unzip it.
H
Hui Zhang 已提交
96 97 98 99 100 101 102 103 104 105
```bash
unzip pwg_ljspeech_ckpt_0.5.zip
```
Parallel WaveGAN checkpoint contains files listed below.
```text
pwg_ljspeech_ckpt_0.5
├── pwg_default.yaml              # default config used to train parallel wavegan
├── pwg_snapshot_iter_400000.pdz  # generator parameters of parallel wavegan
└── pwg_stats.npy                 # statistics used to normalize spectrogram when training parallel wavegan
```
小湉湉's avatar
小湉湉 已提交
106
`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
H
Hui Zhang 已提交
107
```bash
小湉湉's avatar
小湉湉 已提交
108
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
H
Hui Zhang 已提交
109
```
小湉湉's avatar
小湉湉 已提交
110
```text
小湉湉's avatar
小湉湉 已提交
111
usage: synthesize.py [-h]
小湉湉's avatar
小湉湉 已提交
112
                     [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech,tacotron2_aishell3}]
小湉湉's avatar
小湉湉 已提交
113 114 115 116
                     [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
                     [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
                     [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT]
                     [--voice-cloning VOICE_CLONING]
小湉湉's avatar
小湉湉 已提交
117
                     [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,wavernn_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,style_melgan_csmsc}]
小湉湉's avatar
小湉湉 已提交
118 119 120 121 122
                     [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
                     [--voc_stat VOC_STAT] [--ngpu NGPU]
                     [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]

Synthesize with acoustic model & vocoder
H
Hui Zhang 已提交
123 124 125

optional arguments:
  -h, --help            show this help message and exit
小湉湉's avatar
小湉湉 已提交
126
  --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech,tacotron2_aishell3}
小湉湉's avatar
小湉湉 已提交
127 128
                        Choose acoustic model type of tts task.
  --am_config AM_CONFIG
小湉湉's avatar
小湉湉 已提交
129
                        Config of acoustic model.
小湉湉's avatar
小湉湉 已提交
130 131 132 133
  --am_ckpt AM_CKPT     Checkpoint file of acoustic model.
  --am_stat AM_STAT     mean and standard deviation used to normalize
                        spectrogram when training acoustic model.
  --phones_dict PHONES_DICT
H
Hui Zhang 已提交
134
                        phone vocabulary file.
小湉湉's avatar
小湉湉 已提交
135 136 137 138 139 140
  --tones_dict TONES_DICT
                        tone vocabulary file.
  --speaker_dict SPEAKER_DICT
                        speaker id map file.
  --voice-cloning VOICE_CLONING
                        whether training voice cloning model.
小湉湉's avatar
小湉湉 已提交
141
  --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,wavernn_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,style_melgan_csmsc}
小湉湉's avatar
小湉湉 已提交
142 143
                        Choose vocoder type of tts task.
  --voc_config VOC_CONFIG
小湉湉's avatar
小湉湉 已提交
144
                        Config of voc.
小湉湉's avatar
小湉湉 已提交
145 146 147 148 149
  --voc_ckpt VOC_CKPT   Checkpoint file of voc.
  --voc_stat VOC_STAT   mean and standard deviation used to normalize
                        spectrogram when training voc.
  --ngpu NGPU           if ngpu == 0, use cpu.
  --test_metadata TEST_METADATA
H
Hui Zhang 已提交
150
                        test metadata.
小湉湉's avatar
小湉湉 已提交
151
  --output_dir OUTPUT_DIR
H
Hui Zhang 已提交
152 153
                        output dir.
```
小湉湉's avatar
小湉湉 已提交
154
`./local/synthesize_e2e.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file.
H
Hui Zhang 已提交
155
```bash
小湉湉's avatar
小湉湉 已提交
156
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name}
H
Hui Zhang 已提交
157 158
```
```text
小湉湉's avatar
小湉湉 已提交
159
usage: synthesize_e2e.py [-h]
小湉湉's avatar
小湉湉 已提交
160
                         [--am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}]
小湉湉's avatar
小湉湉 已提交
161 162 163 164
                         [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
                         [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
                         [--tones_dict TONES_DICT]
                         [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
小湉湉's avatar
小湉湉 已提交
165
                         [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}]
小湉湉's avatar
小湉湉 已提交
166 167 168 169 170 171
                         [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
                         [--voc_stat VOC_STAT] [--lang LANG]
                         [--inference_dir INFERENCE_DIR] [--ngpu NGPU]
                         [--text TEXT] [--output_dir OUTPUT_DIR]

Synthesize with acoustic model & vocoder
H
Hui Zhang 已提交
172 173 174

optional arguments:
  -h, --help            show this help message and exit
小湉湉's avatar
小湉湉 已提交
175
  --am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}
小湉湉's avatar
小湉湉 已提交
176 177
                        Choose acoustic model type of tts task.
  --am_config AM_CONFIG
小湉湉's avatar
小湉湉 已提交
178
                        Config of acoustic model.
小湉湉's avatar
小湉湉 已提交
179 180 181 182
  --am_ckpt AM_CKPT     Checkpoint file of acoustic model.
  --am_stat AM_STAT     mean and standard deviation used to normalize
                        spectrogram when training acoustic model.
  --phones_dict PHONES_DICT
H
Hui Zhang 已提交
183
                        phone vocabulary file.
小湉湉's avatar
小湉湉 已提交
184 185 186 187 188
  --tones_dict TONES_DICT
                        tone vocabulary file.
  --speaker_dict SPEAKER_DICT
                        speaker id map file.
  --spk_id SPK_ID       spk id for multi speaker acoustic model
小湉湉's avatar
小湉湉 已提交
189
  --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}
小湉湉's avatar
小湉湉 已提交
190 191
                        Choose vocoder type of tts task.
  --voc_config VOC_CONFIG
小湉湉's avatar
小湉湉 已提交
192
                        Config of voc.
小湉湉's avatar
小湉湉 已提交
193 194 195 196 197
  --voc_ckpt VOC_CKPT   Checkpoint file of voc.
  --voc_stat VOC_STAT   mean and standard deviation used to normalize
                        spectrogram when training voc.
  --lang LANG           Choose model language. zh or en
  --inference_dir INFERENCE_DIR
198 199
                        dir to save inference models
  --ngpu NGPU           if ngpu == 0, use cpu.
小湉湉's avatar
小湉湉 已提交
200 201 202
  --text TEXT           text to synthesize, a 'utt_id sentence' pair per line.
  --output_dir OUTPUT_DIR
                        output dir.
H
Hui Zhang 已提交
203
```
小湉湉's avatar
小湉湉 已提交
204
1. `--am` is acoustic model type with the format {model_name}_{dataset}
小湉湉's avatar
小湉湉 已提交
205
2. `--am_config`, `--am_ckpt`, `--am_stat` and `--phones_dict` are arguments for acoustic model, which correspond to the 4 files in the fastspeech2 pretrained model.
小湉湉's avatar
小湉湉 已提交
206
3. `--voc` is vocoder type with the format {model_name}_{dataset}
小湉湉's avatar
小湉湉 已提交
207
4. `--voc_config`, `--voc_ckpt`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
小湉湉's avatar
小湉湉 已提交
208 209 210 211 212
5. `--lang` is the model language, which can be `zh` or `en`.
6. `--test_metadata` should be the metadata file in the normalized subfolder of `test`  in the `dump` folder.
7. `--text` is the text file, which contains sentences to synthesize.
8. `--output_dir` is the directory to save synthesized audio files.
9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
H
Hui Zhang 已提交
213

小湉湉's avatar
小湉湉 已提交
214
## Pretrained Model
215 216
Pretrained FastSpeech2 model with no silence in the edge of audios:
- [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)
小湉湉's avatar
小湉湉 已提交
217

218 219 220 221 222 223
The static model can be downloaded here:
- [fastspeech2_ljspeech_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_ljspeech_static_1.1.0.zip)

The ONNX model can be downloaded here:
- [fastspeech2_ljspeech_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_ljspeech_onnx_1.1.0.zip)

224 225 226
The Paddle-Lite model can be downloaded here:
- [fastspeech2_ljspeech_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_ljspeech_pdlite_1.3.0.zip)

227

小湉湉's avatar
小湉湉 已提交
228 229
Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss 
:-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
小湉湉's avatar
小湉湉 已提交
230
default| 2(gpu) x 100000| 1.505682|0.612104| 0.045505| 0.62792| 0.220147
小湉湉's avatar
小湉湉 已提交
231 232


小湉湉's avatar
小湉湉 已提交
233 234 235 236 237 238 239 240 241
FastSpeech2 checkpoint contains files listed below.
```text
fastspeech2_nosil_ljspeech_ckpt_0.5
├── default.yaml             # default config used to train fastspeech2
├── phone_id_map.txt         # phone vocabulary file when training fastspeech2
├── snapshot_iter_100000.pdz # model parameters and optimizer states
└── speech_stats.npy         # statistics used to normalize spectrogram when training fastspeech2
```
You can use the following scripts to synthesize for `${BIN_DIR}/../sentences_en.txt` using pretrained fastspeech2 and parallel wavegan models.
H
Hui Zhang 已提交
242
```bash
243 244
source path.sh

H
Hui Zhang 已提交
245 246
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
小湉湉's avatar
小湉湉 已提交
247 248 249 250 251 252 253 254 255 256
python3 ${BIN_DIR}/../synthesize_e2e.py \
  --am=fastspeech2_ljspeech \
  --am_config=fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml \
  --am_ckpt=fastspeech2_nosil_ljspeech_ckpt_0.5/snapshot_iter_100000.pdz \
  --am_stat=fastspeech2_nosil_ljspeech_ckpt_0.5/speech_stats.npy \
  --voc=pwgan_ljspeech\
  --voc_config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \
  --voc_ckpt=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz  \
  --voc_stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \
  --lang=en \
H
Hui Zhang 已提交
257
  --text=${BIN_DIR}/../../assets/sentences_en.txt \
小湉湉's avatar
小湉湉 已提交
258 259 260
  --output_dir=exp/default/test_e2e \
  --inference_dir=exp/default/inference \
  --phones_dict=fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt
H
Hui Zhang 已提交
261
```