README.md 12.9 KB
Newer Older
H
Hui Zhang 已提交
1 2 3
# FastSpeech2 with AISHELL-3
This example contains code used to train a [Fastspeech2](https://arxiv.org/abs/2006.04558) model with [AISHELL-3](http://www.aishelltech.com/aishell_3).

小湉湉's avatar
小湉湉 已提交
4
AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus that could be used to train multi-speaker Text-to-Speech (TTS) systems.
H
Hui Zhang 已提交
5 6 7

We use AISHELL-3 to train a multi-speaker fastspeech2 model here.
## Dataset
小湉湉's avatar
小湉湉 已提交
8
### Download and Extract
H
Hui Zhang 已提交
9 10 11 12 13 14 15 16 17
Download AISHELL-3.
```bash
wget https://www.openslr.org/resources/93/data_aishell3.tgz
```
Extract AISHELL-3.
```bash
mkdir data_aishell3
tar zxvf data_aishell3.tgz -C data_aishell3
```
小湉湉's avatar
小湉湉 已提交
18
### Get MFA Result and Extract
H
Hui Zhang 已提交
19
We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
小湉湉's avatar
小湉湉 已提交
20
You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
小湉湉's avatar
小湉湉 已提交
21 22

## Get Started
H
Hui Zhang 已提交
23 24
Assume the path to the dataset is `~/datasets/data_aishell3`.
Assume the path to the MFA result of AISHELL-3 is `./aishell3_alignment_tone`.
小湉湉's avatar
小湉湉 已提交
25 26
Run the command below to
1. **source path**.
27
2. preprocess the dataset.
小湉湉's avatar
小湉湉 已提交
28 29 30 31
3. train the model.
4. synthesize wavs.
    - synthesize waveform from `metadata.jsonl`.
    - synthesize waveform from text file.
H
Hui Zhang 已提交
32
```bash
小湉湉's avatar
小湉湉 已提交
33 34
./run.sh
```
小湉湉's avatar
小湉湉 已提交
35
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
小湉湉's avatar
小湉湉 已提交
36 37 38 39 40
```bash
./run.sh --stage 0 --stop-stage 0
```

### Data Preprocessing
小湉湉's avatar
小湉湉 已提交
41 42
```bash
./local/preprocess.sh ${conf_path}
H
Hui Zhang 已提交
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
```
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
```text
dump
├── dev
│   ├── norm
│   └── raw
├── phone_id_map.txt
├── speaker_id_map.txt
├── test
│   ├── norm
│   └── raw
└── train
    ├── energy_stats.npy
    ├── norm
    ├── pitch_stats.npy
    ├── raw
    └── speech_stats.npy
```
小湉湉's avatar
小湉湉 已提交
62
The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech、pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
H
Hui Zhang 已提交
63

小湉湉's avatar
小湉湉 已提交
64
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, the path of pitch features, a path of energy features, speaker, and id of each utterance.
H
Hui Zhang 已提交
65

小湉湉's avatar
小湉湉 已提交
66
### Model Training
小湉湉's avatar
小湉湉 已提交
67
`./local/train.sh` calls `${BIN_DIR}/train.py`.
H
Hui Zhang 已提交
68
```bash
小湉湉's avatar
小湉湉 已提交
69
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
H
Hui Zhang 已提交
70 71 72 73 74
```
Here's the complete help message.
```text
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
                [--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
75 76
                [--ngpu NGPU] [--phones-dict PHONES_DICT]
                [--speaker-dict SPEAKER_DICT] [--voice-cloning VOICE_CLONING]
H
Hui Zhang 已提交
77 78 79 80 81 82 83 84 85 86 87 88

Train a FastSpeech2 model.

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       fastspeech2 config file.
  --train-metadata TRAIN_METADATA
                        training data.
  --dev-metadata DEV_METADATA
                        dev data.
  --output-dir OUTPUT_DIR
                        output dir.
89
  --ngpu NGPU           if ngpu=0, use cpu.
H
Hui Zhang 已提交
90 91 92 93
  --phones-dict PHONES_DICT
                        phone vocabulary file.
  --speaker-dict SPEAKER_DICT
                        speaker id map file for multiple speaker model.
94 95
  --voice-cloning VOICE_CLONING
                        whether training voice cloning model.
H
Hui Zhang 已提交
96 97 98
```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
小湉湉's avatar
小湉湉 已提交
99
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
100 101
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
5. `--phones-dict` is the path of the phone vocabulary file.
小湉湉's avatar
小湉湉 已提交
102
6. `--speaker-dict` is the path of the speaker id map file when training a multi-speaker FastSpeech2.
H
Hui Zhang 已提交
103

小湉湉's avatar
小湉湉 已提交
104
### Synthesizing
M
Mingxue-Xu 已提交
105
We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1) as the neural vocoder.
小湉湉's avatar
小湉湉 已提交
106
Download the pretrained parallel wavegan model from [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip) and unzip it.
H
Hui Zhang 已提交
107
```bash
108
unzip pwg_aishell3_ckpt_0.5.zip
H
Hui Zhang 已提交
109 110 111
```
Parallel WaveGAN checkpoint contains files listed below.
```text
112 113 114 115
pwg_aishell3_ckpt_0.5
├── default.yaml                   # default config used to train parallel wavegan
├── feats_stats.npy                # statistics used to normalize spectrogram when training parallel wavegan
└── snapshot_iter_1000000.pdz      # generator parameters of parallel wavegan
H
Hui Zhang 已提交
116
```
小湉湉's avatar
小湉湉 已提交
117
`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
H
Hui Zhang 已提交
118
```bash
小湉湉's avatar
小湉湉 已提交
119
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
H
Hui Zhang 已提交
120 121
```
```text
小湉湉's avatar
小湉湉 已提交
122 123 124 125 126 127 128 129 130 131
usage: synthesize.py [-h]
                     [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
                     [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
                     [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
                     [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT]
                     [--voice-cloning VOICE_CLONING]
                     [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
                     [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
                     [--voc_stat VOC_STAT] [--ngpu NGPU]
                     [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
H
Hui Zhang 已提交
132

小湉湉's avatar
小湉湉 已提交
133
Synthesize with acoustic model & vocoder
H
Hui Zhang 已提交
134 135 136

optional arguments:
  -h, --help            show this help message and exit
小湉湉's avatar
小湉湉 已提交
137 138 139 140 141 142 143 144 145
  --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
                        Choose acoustic model type of tts task.
  --am_config AM_CONFIG
                        Config of acoustic model. Use deault config when it is
                        None.
  --am_ckpt AM_CKPT     Checkpoint file of acoustic model.
  --am_stat AM_STAT     mean and standard deviation used to normalize
                        spectrogram when training acoustic model.
  --phones_dict PHONES_DICT
H
Hui Zhang 已提交
146
                        phone vocabulary file.
小湉湉's avatar
小湉湉 已提交
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161
  --tones_dict TONES_DICT
                        tone vocabulary file.
  --speaker_dict SPEAKER_DICT
                        speaker id map file.
  --voice-cloning VOICE_CLONING
                        whether training voice cloning model.
  --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
                        Choose vocoder type of tts task.
  --voc_config VOC_CONFIG
                        Config of voc. Use deault config when it is None.
  --voc_ckpt VOC_CKPT   Checkpoint file of voc.
  --voc_stat VOC_STAT   mean and standard deviation used to normalize
                        spectrogram when training voc.
  --ngpu NGPU           if ngpu == 0, use cpu.
  --test_metadata TEST_METADATA
H
Hui Zhang 已提交
162
                        test metadata.
小湉湉's avatar
小湉湉 已提交
163
  --output_dir OUTPUT_DIR
H
Hui Zhang 已提交
164 165
                        output dir.
```
小湉湉's avatar
小湉湉 已提交
166
`./local/synthesize_e2e.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file.
H
Hui Zhang 已提交
167
```bash
小湉湉's avatar
小湉湉 已提交
168
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name}
H
Hui Zhang 已提交
169 170
```
```text
小湉湉's avatar
小湉湉 已提交
171 172 173 174 175 176 177 178 179 180 181 182 183
usage: synthesize_e2e.py [-h]
                         [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
                         [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
                         [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
                         [--tones_dict TONES_DICT]
                         [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
                         [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
                         [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
                         [--voc_stat VOC_STAT] [--lang LANG]
                         [--inference_dir INFERENCE_DIR] [--ngpu NGPU]
                         [--text TEXT] [--output_dir OUTPUT_DIR]

Synthesize with acoustic model & vocoder
H
Hui Zhang 已提交
184 185 186

optional arguments:
  -h, --help            show this help message and exit
小湉湉's avatar
小湉湉 已提交
187 188 189 190 191 192 193 194 195
  --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
                        Choose acoustic model type of tts task.
  --am_config AM_CONFIG
                        Config of acoustic model. Use deault config when it is
                        None.
  --am_ckpt AM_CKPT     Checkpoint file of acoustic model.
  --am_stat AM_STAT     mean and standard deviation used to normalize
                        spectrogram when training acoustic model.
  --phones_dict PHONES_DICT
H
Hui Zhang 已提交
196
                        phone vocabulary file.
小湉湉's avatar
小湉湉 已提交
197 198 199
  --tones_dict TONES_DICT
                        tone vocabulary file.
  --speaker_dict SPEAKER_DICT
H
Hui Zhang 已提交
200
                        speaker id map file.
小湉湉's avatar
小湉湉 已提交
201 202 203 204 205 206 207 208 209 210 211 212
  --spk_id SPK_ID       spk id for multi speaker acoustic model
  --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
                        Choose vocoder type of tts task.
  --voc_config VOC_CONFIG
                        Config of voc. Use deault config when it is None.
  --voc_ckpt VOC_CKPT   Checkpoint file of voc.
  --voc_stat VOC_STAT   mean and standard deviation used to normalize
                        spectrogram when training voc.
  --lang LANG           Choose model language. zh or en
  --inference_dir INFERENCE_DIR
                        dir to save inference models
  --ngpu NGPU           if ngpu == 0, use cpu.
H
Hui Zhang 已提交
213
  --text TEXT           text to synthesize, a 'utt_id sentence' pair per line.
小湉湉's avatar
小湉湉 已提交
214
  --output_dir OUTPUT_DIR
H
Hui Zhang 已提交
215 216
                        output dir.
```
小湉湉's avatar
小湉湉 已提交
217 218 219 220 221 222 223 224 225
1. `--am` is acoustic model type with the format {model_name}_{dataset}
2. `--am_config`, `--am_checkpoint`, `--am_stat`, `--phones_dict` `--speaker_dict` are arguments for acoustic model, which correspond to the 5 files in the fastspeech2 pretrained model.
3. `--voc` is vocoder type with the format {model_name}_{dataset}
4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
5. `--lang` is the model language, which can be `zh` or `en`.
6. `--test_metadata` should be the metadata file in the normalized subfolder of `test`  in the `dump` folder.
7. `--text` is the text file, which contains sentences to synthesize.
8. `--output_dir` is the directory to save synthesized audio files.
9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
H
Hui Zhang 已提交
226

小湉湉's avatar
小湉湉 已提交
227
## Pretrained Model
小湉湉's avatar
小湉湉 已提交
228
Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_ckpt_0.4.zip)
小湉湉's avatar
小湉湉 已提交
229 230 231 232 233 234 235 236 237 238 239 240

FastSpeech2 checkpoint contains files listed below.

```text
fastspeech2_nosil_aishell3_ckpt_0.4
├── default.yaml            # default config used to train fastspeech2
├── phone_id_map.txt        # phone vocabulary file when training fastspeech2
├── snapshot_iter_96400.pdz # model parameters and optimizer states
├── speaker_id_map.txt      # speaker id map file when training a multi-speaker fastspeech2
└── speech_stats.npy        # statistics used to normalize spectrogram when training fastspeech2
```
You can use the following scripts to synthesize for `${BIN_DIR}/../sentences.txt` using pretrained fastspeech2 and parallel wavegan models.
H
Hui Zhang 已提交
241
```bash
242 243
source path.sh

H
Hui Zhang 已提交
244 245
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
小湉湉's avatar
小湉湉 已提交
246 247 248 249 250 251 252 253 254 255
python3 ${BIN_DIR}/../synthesize_e2e.py \
  --am=fastspeech2_aishell3 \
  --am_config=fastspeech2_nosil_aishell3_ckpt_0.4/default.yaml \
  --am_ckpt=fastspeech2_nosil_aishell3_ckpt_0.4/snapshot_iter_96400.pdz \
  --am_stat=fastspeech2_nosil_aishell3_ckpt_0.4/speech_stats.npy \
  --voc=pwgan_aishell3 \
  --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
  --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
  --voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
  --lang=zh \
小湉湉's avatar
小湉湉 已提交
256
  --text=${BIN_DIR}/../sentences.txt \
小湉湉's avatar
小湉湉 已提交
257 258 259 260
  --output_dir=exp/default/test_e2e \
  --phones_dict=fastspeech2_nosil_aishell3_ckpt_0.4/phone_id_map.txt \
  --speaker_dict=fastspeech2_nosil_aishell3_ckpt_0.4/speaker_id_map.txt \
  --spk_id=0
H
Hui Zhang 已提交
261 262

```