README.md 12.6 KB
Newer Older
小湉湉's avatar
小湉湉 已提交
1 2
# FastSpeech2 with the VCTK
This example contains code used to train a [Fastspeech2](https://arxiv.org/abs/2006.04558) model with [VCTK](https://datashare.ed.ac.uk/handle/10283/3443).
H
Hui Zhang 已提交
3 4

## Dataset
小湉湉's avatar
小湉湉 已提交
5
### Download and Extract the dataset
小湉湉's avatar
小湉湉 已提交
6
Download VCTK-0.92 from the [official website](https://datashare.ed.ac.uk/handle/10283/3443).
H
Hui Zhang 已提交
7

小湉湉's avatar
小湉湉 已提交
8
### Get MFA Result and Extract
H
Hui Zhang 已提交
9
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2.
小湉湉's avatar
小湉湉 已提交
10
You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
小湉湉's avatar
小湉湉 已提交
11
ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/mfa/local/reorganize_vctk.py)):
小湉湉's avatar
小湉湉 已提交
12
1. `p315`, because of no text for it.
小湉湉's avatar
小湉湉 已提交
13 14 15 16 17 18 19
2. `p280` and `p362`, because no *_mic2.flac (which is better than *_mic1.flac) for  them.

## Get Started
Assume the path to the dataset is `~/datasets/VCTK-Corpus-0.92`.
Assume the path to the MFA result of VCTK is `./vctk_alignment`.
Run the command below to
1. **source path**.
20
2. preprocess the dataset.
小湉湉's avatar
小湉湉 已提交
21 22 23 24 25 26 27
3. train the model.
4. synthesize wavs.
    - synthesize waveform from `metadata.jsonl`.
    - synthesize waveform from text file.
```bash
./run.sh
```
小湉湉's avatar
小湉湉 已提交
28
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
小湉湉's avatar
小湉湉 已提交
29 30 31 32
```bash
./run.sh --stage 0 --stop-stage 0
```
### Data Preprocessing
H
Hui Zhang 已提交
33
```bash
小湉湉's avatar
小湉湉 已提交
34
./local/preprocess.sh ${conf_path}
H
Hui Zhang 已提交
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
```
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.

```text
dump
├── dev
│   ├── norm
│   └── raw
├── phone_id_map.txt
├── speaker_id_map.txt
├── test
│   ├── norm
│   └── raw
└── train
    ├── energy_stats.npy
    ├── norm
    ├── pitch_stats.npy
    ├── raw
    └── speech_stats.npy
```
小湉湉's avatar
小湉湉 已提交
55
The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech、pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
H
Hui Zhang 已提交
56

小湉湉's avatar
小湉湉 已提交
57
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, the path of pitch features, the path of energy features, speaker, and id of each utterance.
H
Hui Zhang 已提交
58

小湉湉's avatar
小湉湉 已提交
59
### Model Training
H
Hui Zhang 已提交
60
```bash
小湉湉's avatar
小湉湉 已提交
61
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
H
Hui Zhang 已提交
62
```
小湉湉's avatar
小湉湉 已提交
63
`./local/train.sh` calls `${BIN_DIR}/train.py`.
H
Hui Zhang 已提交
64 65 66 67
Here's the complete help message.
```text
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
                [--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
68 69
                [--ngpu NGPU] [--phones-dict PHONES_DICT]
                [--speaker-dict SPEAKER_DICT] [--voice-cloning VOICE_CLONING]
H
Hui Zhang 已提交
70 71 72 73 74 75 76 77 78 79 80 81

Train a FastSpeech2 model.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       fastspeech2 config file.
  --train-metadata TRAIN_METADATA
                        training data.
  --dev-metadata DEV_METADATA
                        dev data.
  --output-dir OUTPUT_DIR
                        output dir.
82
  --ngpu NGPU           if ngpu=0, use cpu.
H
Hui Zhang 已提交
83 84 85 86
  --phones-dict PHONES_DICT
                        phone vocabulary file.
  --speaker-dict SPEAKER_DICT
                        speaker id map file for multiple speaker model.
87 88
  --voice-cloning VOICE_CLONING
                        whether training voice cloning model.
H
Hui Zhang 已提交
89 90 91
```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
小湉湉's avatar
小湉湉 已提交
92
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
93
4. `--phones-dict` is the path of the phone vocabulary file.
H
Hui Zhang 已提交
94

小湉湉's avatar
小湉湉 已提交
95
### Synthesizing
96
We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/voc1) as the neural vocoder.
小湉湉's avatar
小湉湉 已提交
97

小湉湉's avatar
小湉湉 已提交
98
Download pretrained parallel wavegan model from [pwg_vctk_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.1.1.zip) and unzip it.
H
Hui Zhang 已提交
99
```bash
小湉湉's avatar
小湉湉 已提交
100
unzip pwg_vctk_ckpt_0.1.1.zip
H
Hui Zhang 已提交
101 102 103
```
Parallel WaveGAN checkpoint contains files listed below.
```text
小湉湉's avatar
小湉湉 已提交
104 105 106 107
pwg_vctk_ckpt_0.1.1
├── default.yaml                   # default config used to train parallel wavegan
├── snapshot_iter_1500000.pdz      # generator parameters of parallel wavegan
└── feats_stats.npy                # statistics used to normalize spectrogram when training parallel wavegan
H
Hui Zhang 已提交
108
```
小湉湉's avatar
小湉湉 已提交
109
`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
H
Hui Zhang 已提交
110
```bash
小湉湉's avatar
小湉湉 已提交
111
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
H
Hui Zhang 已提交
112 113
```
```text
小湉湉's avatar
小湉湉 已提交
114 115 116 117 118 119 120 121 122 123 124 125
usage: synthesize.py [-h]
                     [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
                     [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
                     [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
                     [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT]
                     [--voice-cloning VOICE_CLONING]
                     [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
                     [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
                     [--voc_stat VOC_STAT] [--ngpu NGPU]
                     [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]

Synthesize with acoustic model & vocoder
H
Hui Zhang 已提交
126 127 128

optional arguments:
  -h, --help            show this help message and exit
小湉湉's avatar
小湉湉 已提交
129 130 131 132 133 134 135 136 137
  --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
                        Choose acoustic model type of tts task.
  --am_config AM_CONFIG
                        Config of acoustic model. Use deault config when it is
                        None.
  --am_ckpt AM_CKPT     Checkpoint file of acoustic model.
  --am_stat AM_STAT     mean and standard deviation used to normalize
                        spectrogram when training acoustic model.
  --phones_dict PHONES_DICT
H
Hui Zhang 已提交
138
                        phone vocabulary file.
小湉湉's avatar
小湉湉 已提交
139 140 141 142 143 144 145 146 147 148 149 150 151 152 153
  --tones_dict TONES_DICT
                        tone vocabulary file.
  --speaker_dict SPEAKER_DICT
                        speaker id map file.
  --voice-cloning VOICE_CLONING
                        whether training voice cloning model.
  --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
                        Choose vocoder type of tts task.
  --voc_config VOC_CONFIG
                        Config of voc. Use deault config when it is None.
  --voc_ckpt VOC_CKPT   Checkpoint file of voc.
  --voc_stat VOC_STAT   mean and standard deviation used to normalize
                        spectrogram when training voc.
  --ngpu NGPU           if ngpu == 0, use cpu.
  --test_metadata TEST_METADATA
H
Hui Zhang 已提交
154
                        test metadata.
小湉湉's avatar
小湉湉 已提交
155
  --output_dir OUTPUT_DIR
H
Hui Zhang 已提交
156 157
                        output dir.
```
小湉湉's avatar
小湉湉 已提交
158
`./local/synthesize_e2e.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file.
H
Hui Zhang 已提交
159
```bash
小湉湉's avatar
小湉湉 已提交
160
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name}
H
Hui Zhang 已提交
161 162
```
```text
小湉湉's avatar
小湉湉 已提交
163 164 165 166 167 168 169 170 171 172 173 174 175
usage: synthesize_e2e.py [-h]
                         [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
                         [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
                         [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
                         [--tones_dict TONES_DICT]
                         [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
                         [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
                         [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
                         [--voc_stat VOC_STAT] [--lang LANG]
                         [--inference_dir INFERENCE_DIR] [--ngpu NGPU]
                         [--text TEXT] [--output_dir OUTPUT_DIR]

Synthesize with acoustic model & vocoder
H
Hui Zhang 已提交
176 177 178

optional arguments:
  -h, --help            show this help message and exit
小湉湉's avatar
小湉湉 已提交
179 180 181 182 183 184 185 186 187
  --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
                        Choose acoustic model type of tts task.
  --am_config AM_CONFIG
                        Config of acoustic model. Use deault config when it is
                        None.
  --am_ckpt AM_CKPT     Checkpoint file of acoustic model.
  --am_stat AM_STAT     mean and standard deviation used to normalize
                        spectrogram when training acoustic model.
  --phones_dict PHONES_DICT
H
Hui Zhang 已提交
188
                        phone vocabulary file.
小湉湉's avatar
小湉湉 已提交
189 190 191
  --tones_dict TONES_DICT
                        tone vocabulary file.
  --speaker_dict SPEAKER_DICT
小湉湉's avatar
小湉湉 已提交
192
                        speaker id map file.
小湉湉's avatar
小湉湉 已提交
193 194 195 196 197 198 199 200 201 202 203 204
  --spk_id SPK_ID       spk id for multi speaker acoustic model
  --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
                        Choose vocoder type of tts task.
  --voc_config VOC_CONFIG
                        Config of voc. Use deault config when it is None.
  --voc_ckpt VOC_CKPT   Checkpoint file of voc.
  --voc_stat VOC_STAT   mean and standard deviation used to normalize
                        spectrogram when training voc.
  --lang LANG           Choose model language. zh or en
  --inference_dir INFERENCE_DIR
                        dir to save inference models
  --ngpu NGPU           if ngpu == 0, use cpu.
H
Hui Zhang 已提交
205
  --text TEXT           text to synthesize, a 'utt_id sentence' pair per line.
小湉湉's avatar
小湉湉 已提交
206
  --output_dir OUTPUT_DIR
H
Hui Zhang 已提交
207 208
                        output dir.
```
小湉湉's avatar
小湉湉 已提交
209 210 211 212 213 214 215 216 217
1. `--am` is acoustic model type with the format {model_name}_{dataset}
2. `--am_config`, `--am_checkpoint`, `--am_stat`, `--phones_dict` `--speaker_dict` are arguments for acoustic model, which correspond to the 5 files in the fastspeech2 pretrained model.
3. `--voc` is vocoder type with the format {model_name}_{dataset}
4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
5. `--lang` is the model language, which can be `zh` or `en`.
6. `--test_metadata` should be the metadata file in the normalized subfolder of `test`  in the `dump` folder.
7. `--text` is the text file, which contains sentences to synthesize.
8. `--output_dir` is the directory to save synthesized audio files.
9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
H
Hui Zhang 已提交
218

小湉湉's avatar
小湉湉 已提交
219
## Pretrained Model
小湉湉's avatar
小湉湉 已提交
220
Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip)
小湉湉's avatar
小湉湉 已提交
221 222 223 224 225 226 227 228 229 230 231

FastSpeech2 checkpoint contains files listed below.
```text
fastspeech2_nosil_vctk_ckpt_0.5
├── default.yaml            # default config used to train fastspeech2
├── phone_id_map.txt        # phone vocabulary file when training fastspeech2
├── snapshot_iter_66200.pdz # model parameters and optimizer states
├── speaker_id_map.txt      # speaker id map file when training a multi-speaker fastspeech2
└── speech_stats.npy        # statistics used to normalize spectrogram when training fastspeech2
```
You can use the following scripts to synthesize for `${BIN_DIR}/../sentences.txt` using pretrained fastspeech2 and parallel wavegan models.
H
Hui Zhang 已提交
232
```bash
233 234
source path.sh

H
Hui Zhang 已提交
235 236
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
小湉湉's avatar
小湉湉 已提交
237 238 239 240 241 242
python3 ${BIN_DIR}/../synthesize_e2e.py \
  --am=fastspeech2_vctk \
  --am_config=fastspeech2_nosil_vctk_ckpt_0.5/default.yaml \
  --am_ckpt=fastspeech2_nosil_vctk_ckpt_0.5/snapshot_iter_66200.pdz \
  --am_stat=fastspeech2_nosil_vctk_ckpt_0.5/speech_stats.npy \
  --voc=pwgan_vctk \
243 244 245
  --voc_config=pwg_vctk_ckpt_0.1.1/default.yaml  \
  --voc_ckpt=pwg_vctk_ckpt_0.1.1/snapshot_iter_1500000.pdz \
  --voc_stat=pwg_vctk_ckpt_0.1.1/feats_stats.npy \
小湉湉's avatar
小湉湉 已提交
246
  --lang=en \
247
  --text=${BIN_DIR}/../sentences_en.txt \
小湉湉's avatar
小湉湉 已提交
248
  --output_dir=exp/default/test_e2e \
小湉湉's avatar
小湉湉 已提交
249 250
  --phones_dict=fastspeech2_nosil_vctk_ckpt_0.5/phone_id_map.txt \
  --speaker_dict=fastspeech2_nosil_vctk_ckpt_0.5/speaker_id_map.txt \
251 252
  --spk_id=0 \
  --inference_dir=exp/default/inference
H
Hui Zhang 已提交
253
```