README.md 13.5 KB
Newer Older
Z
Zeyu Chen 已提交
1 2
# SpeedySpeech with CSMSC
This example contains code used to train a [SpeedySpeech](http://arxiv.org/abs/2008.03802) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html). NOTE that we only implement the student part of the Speedyspeech model. The ground truth alignment used to train the model is extracted from the dataset using [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner).
H
Hui Zhang 已提交
3 4

## Dataset
小湉湉's avatar
小湉湉 已提交
5
### Download and Extract
H
Hui Zhang 已提交
6 7
Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/source).

小湉湉's avatar
小湉湉 已提交
8
### Get MFA Result and Extract
H
Hui Zhang 已提交
9
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for SPEEDYSPEECH.
小湉湉's avatar
小湉湉 已提交
10
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to  [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
H
Hui Zhang 已提交
11

小湉湉's avatar
小湉湉 已提交
12
## Get Started
H
Hui Zhang 已提交
13 14
Assume the path to the dataset is `~/datasets/BZNSYP`.
Assume the path to the MFA result of CSMSC is `./baker_alignment_tone`.
小湉湉's avatar
小湉湉 已提交
15 16
Run the command below to
1. **source path**.
17
2. preprocess the dataset.
小湉湉's avatar
小湉湉 已提交
18 19 20
3. train the model.
4. synthesize wavs.
    - synthesize waveform from `metadata.jsonl`.
小湉湉's avatar
小湉湉 已提交
21 22
    - synthesize waveform from a text file.
5. inference using the static model.
H
Hui Zhang 已提交
23
```bash
小湉湉's avatar
小湉湉 已提交
24 25
./run.sh
```
小湉湉's avatar
小湉湉 已提交
26 27 28 29 30
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset.
```bash
./run.sh --stage 0 --stop-stage 0
```
### Data Preprocessing
小湉湉's avatar
小湉湉 已提交
31 32
```bash
./local/preprocess.sh ${conf_path}
H
Hui Zhang 已提交
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
```
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.

```text
dump
├── dev
│   ├── norm
│   └── raw
├── test
│   ├── norm
│   └── raw
└── train
    ├── norm
    ├── raw
    └── feats_stats.npy
```

小湉湉's avatar
小湉湉 已提交
50
The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
H
Hui Zhang 已提交
51

小湉湉's avatar
小湉湉 已提交
52
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, tones, durations, the path of the spectrogram, and the id of each utterance.
H
Hui Zhang 已提交
53

小湉湉's avatar
小湉湉 已提交
54
### Model Training
小湉湉's avatar
小湉湉 已提交
55
`./local/train.sh` calls `${BIN_DIR}/train.py`.
H
Hui Zhang 已提交
56
```bash
小湉湉's avatar
小湉湉 已提交
57
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
H
Hui Zhang 已提交
58 59 60 61
```
Here's the complete help message.
```text
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
62
                [--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
63
                [--ngpu NGPU] [--use-relative-path USE_RELATIVE_PATH]
64
                [--phones-dict PHONES_DICT] [--tones-dict TONES_DICT]
H
Hui Zhang 已提交
65

小湉湉's avatar
小湉湉 已提交
66
Train a Speedyspeech model with a single speaker dataset.
H
Hui Zhang 已提交
67 68 69 70 71 72 73 74 75 76

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       config file.
  --train-metadata TRAIN_METADATA
                        training data.
  --dev-metadata DEV_METADATA
                        dev data.
  --output-dir OUTPUT_DIR
                        output dir.
77
  --ngpu NGPU           if ngpu == 0, use cpu.
H
Hui Zhang 已提交
78 79 80 81 82 83 84 85 86 87
  --use-relative-path USE_RELATIVE_PATH
                        whether use relative path in metadata
  --phones-dict PHONES_DICT
                        phone vocabulary file.
  --tones-dict TONES_DICT
                        tone vocabulary file.
```

1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
小湉湉's avatar
小湉湉 已提交
88
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
89 90 91
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
5. `--phones-dict` is the path of the phone vocabulary file.
6. `--tones-dict` is the path of the tone vocabulary file.
H
Hui Zhang 已提交
92

小湉湉's avatar
小湉湉 已提交
93
### Synthesizing
94
We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc1) as the neural vocoder.
小湉湉's avatar
小湉湉 已提交
95
Download pretrained parallel wavegan model from [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip) and unzip it.
H
Hui Zhang 已提交
96 97 98 99 100 101 102 103 104 105
```bash
unzip pwg_baker_ckpt_0.4.zip
```
Parallel WaveGAN checkpoint contains files listed below.
```text
pwg_baker_ckpt_0.4
├── pwg_default.yaml               # default config used to train parallel wavegan
├── pwg_snapshot_iter_400000.pdz   # model parameters of parallel wavegan
└── pwg_stats.npy                  # statistics used to normalize spectrogram when training parallel wavegan
```
小湉湉's avatar
小湉湉 已提交
106
`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
H
Hui Zhang 已提交
107
```bash
小湉湉's avatar
小湉湉 已提交
108
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
H
Hui Zhang 已提交
109
```
J
Jerryuhoo 已提交
110
```text
小湉湉's avatar
小湉湉 已提交
111 112 113 114 115 116 117 118 119 120 121 122
usage: synthesize.py [-h]
                     [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
                     [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
                     [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
                     [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT]
                     [--voice-cloning VOICE_CLONING]
                     [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
                     [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
                     [--voc_stat VOC_STAT] [--ngpu NGPU]
                     [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]

Synthesize with acoustic model & vocoder
H
Hui Zhang 已提交
123 124 125

optional arguments:
  -h, --help            show this help message and exit
小湉湉's avatar
小湉湉 已提交
126 127 128 129 130 131 132 133 134
  --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
                        Choose acoustic model type of tts task.
  --am_config AM_CONFIG
                        Config of acoustic model. Use deault config when it is
                        None.
  --am_ckpt AM_CKPT     Checkpoint file of acoustic model.
  --am_stat AM_STAT     mean and standard deviation used to normalize
                        spectrogram when training acoustic model.
  --phones_dict PHONES_DICT
H
Hui Zhang 已提交
135
                        phone vocabulary file.
小湉湉's avatar
小湉湉 已提交
136
  --tones_dict TONES_DICT
H
Hui Zhang 已提交
137
                        tone vocabulary file.
小湉湉's avatar
小湉湉 已提交
138 139 140 141 142 143 144 145 146 147 148
  --speaker_dict SPEAKER_DICT
                        speaker id map file.
  --voice-cloning VOICE_CLONING
                        whether training voice cloning model.
  --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
                        Choose vocoder type of tts task.
  --voc_config VOC_CONFIG
                        Config of voc. Use deault config when it is None.
  --voc_ckpt VOC_CKPT   Checkpoint file of voc.
  --voc_stat VOC_STAT   mean and standard deviation used to normalize
                        spectrogram when training voc.
149
  --ngpu NGPU           if ngpu == 0, use cpu.
小湉湉's avatar
小湉湉 已提交
150 151 152 153
  --test_metadata TEST_METADATA
                        test metadata.
  --output_dir OUTPUT_DIR
                        output dir.
H
Hui Zhang 已提交
154
```
小湉湉's avatar
小湉湉 已提交
155
`./local/synthesize_e2e.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file.
H
Hui Zhang 已提交
156
```bash
小湉湉's avatar
小湉湉 已提交
157
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name}
H
Hui Zhang 已提交
158 159
```
```text
小湉湉's avatar
小湉湉 已提交
160 161 162 163 164 165 166 167 168 169 170 171 172
usage: synthesize_e2e.py [-h]
                         [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
                         [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
                         [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
                         [--tones_dict TONES_DICT]
                         [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
                         [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
                         [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
                         [--voc_stat VOC_STAT] [--lang LANG]
                         [--inference_dir INFERENCE_DIR] [--ngpu NGPU]
                         [--text TEXT] [--output_dir OUTPUT_DIR]

Synthesize with acoustic model & vocoder
H
Hui Zhang 已提交
173 174 175

optional arguments:
  -h, --help            show this help message and exit
小湉湉's avatar
小湉湉 已提交
176 177 178 179 180 181 182 183 184
  --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
                        Choose acoustic model type of tts task.
  --am_config AM_CONFIG
                        Config of acoustic model. Use deault config when it is
                        None.
  --am_ckpt AM_CKPT     Checkpoint file of acoustic model.
  --am_stat AM_STAT     mean and standard deviation used to normalize
                        spectrogram when training acoustic model.
  --phones_dict PHONES_DICT
H
Hui Zhang 已提交
185
                        phone vocabulary file.
小湉湉's avatar
小湉湉 已提交
186
  --tones_dict TONES_DICT
H
Hui Zhang 已提交
187
                        tone vocabulary file.
小湉湉's avatar
小湉湉 已提交
188 189 190 191 192 193 194 195 196 197 198 199
  --speaker_dict SPEAKER_DICT
                        speaker id map file.
  --spk_id SPK_ID       spk id for multi speaker acoustic model
  --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
                        Choose vocoder type of tts task.
  --voc_config VOC_CONFIG
                        Config of voc. Use deault config when it is None.
  --voc_ckpt VOC_CKPT   Checkpoint file of voc.
  --voc_stat VOC_STAT   mean and standard deviation used to normalize
                        spectrogram when training voc.
  --lang LANG           Choose model language. zh or en
  --inference_dir INFERENCE_DIR
H
Hui Zhang 已提交
200
                        dir to save inference models
201
  --ngpu NGPU           if ngpu == 0, use cpu.
小湉湉's avatar
小湉湉 已提交
202 203 204
  --text TEXT           text to synthesize, a 'utt_id sentence' pair per line.
  --output_dir OUTPUT_DIR
                        output dir.
H
Hui Zhang 已提交
205
```
小湉湉's avatar
小湉湉 已提交
206 207 208 209 210 211 212 213 214
1. `--am` is acoustic model type with the format {model_name}_{dataset}
2. `--am_config`, `--am_checkpoint`, `--am_stat`, `--phones_dict` and `--tones_dict` are arguments for acoustic model, which correspond to the 5 files in the speedyspeech pretrained model.
3. `--voc` is vocoder type with the format {model_name}_{dataset}
4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
5. `--lang` is the model language, which can be `zh` or `en`.
6. `--test_metadata` should be the metadata file in the normalized subfolder of `test`  in the `dump` folder.
7. `--text` is the text file, which contains sentences to synthesize.
8. `--output_dir` is the directory to save synthesized audio files.
9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
小湉湉's avatar
小湉湉 已提交
215

小湉湉's avatar
小湉湉 已提交
216
### Inferencing
小湉湉's avatar
小湉湉 已提交
217
After synthesizing, we will get static models of speedyspeech and pwgan in `${train_output_path}/inference`.
小湉湉's avatar
小湉湉 已提交
218 219 220 221 222 223
`./local/inference.sh` calls `${BIN_DIR}/inference.py`, which provides a paddle static model inference example for speedyspeech + pwgan synthesize.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path}
```

## Pretrained Model
224 225
Pretrained SpeedySpeech model with no silence in the edge of audios:
- [speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip)
226

227 228
The static model can be downloaded here:
- [speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip)
小湉湉's avatar
小湉湉 已提交
229
- [speedyspeech_csmsc_static_2.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_static_2.0.0.zip)
H
Hui Zhang 已提交
230

小湉湉's avatar
小湉湉 已提交
231 232
Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/ssim_loss
:-------------:| :------------:| :-----: | :-----: | :--------:|:--------:
小湉湉's avatar
小湉湉 已提交
233
default| 1(gpu) x 11400|0.83655|0.42324|0.03211| 0.38119
小湉湉's avatar
小湉湉 已提交
234

小湉湉's avatar
小湉湉 已提交
235 236 237 238 239 240 241 242 243 244
SpeedySpeech checkpoint contains files listed below.
```text
speedyspeech_nosil_baker_ckpt_0.5
├── default.yaml            # default config used to train speedyspeech
├── feats_stats.npy         # statistics used to normalize spectrogram when training speedyspeech
├── phone_id_map.txt        # phone vocabulary file when training speedyspeech
├── snapshot_iter_11400.pdz # model parameters and optimizer states
└── tone_id_map.txt         # tone vocabulary file when training speedyspeech
```
You can use the following scripts to synthesize for `${BIN_DIR}/../sentences.txt` using pretrained speedyspeech and parallel wavegan models.
H
Hui Zhang 已提交
245
```bash
246 247
source path.sh

H
Hui Zhang 已提交
248 249
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
小湉湉's avatar
小湉湉 已提交
250 251 252 253 254 255 256 257 258 259
python3 ${BIN_DIR}/../synthesize_e2e.py \
  --am=speedyspeech_csmsc \
  --am_config=speedyspeech_nosil_baker_ckpt_0.5/default.yaml \
  --am_ckpt=speedyspeech_nosil_baker_ckpt_0.5/snapshot_iter_11400.pdz \
  --am_stat=speedyspeech_nosil_baker_ckpt_0.5/feats_stats.npy \
  --voc=pwgan_csmsc \
  --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
  --voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
  --voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
  --lang=zh \
小湉湉's avatar
小湉湉 已提交
260
  --text=${BIN_DIR}/../sentences.txt \
小湉湉's avatar
小湉湉 已提交
261 262 263 264
  --output_dir=exp/default/test_e2e \
  --inference_dir=exp/default/inference \
  --phones_dict=speedyspeech_nosil_baker_ckpt_0.5/phone_id_map.txt \
  --tones_dict=speedyspeech_nosil_baker_ckpt_0.5/tone_id_map.txt
H
Hui Zhang 已提交
265
```