未验证 提交 19ef7210 编写于 作者: 小湉湉's avatar 小湉湉 提交者: GitHub

[TTS]Add hifigan (#1097)

* add hifigan

* add hifigan

* integrate synthesize synthesize_e2e, inference for tts, test=tts

* add some python files, test=tts

* update readme, test=doc_fix
上级 675cff25
# FastSpeech2 with AISHELL-3 # FastSpeech2 with AISHELL-3
This example contains code used to train a [Fastspeech2](https://arxiv.org/abs/2006.04558) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). This example contains code used to train a [Fastspeech2](https://arxiv.org/abs/2006.04558) model with [AISHELL-3](http://www.aishelltech.com/aishell_3).
AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus that could be used to train multi-speaker Text-to-Speech (TTS) systems.
We use AISHELL-3 to train a multi-speaker fastspeech2 model here. We use AISHELL-3 to train a multi-speaker fastspeech2 model here.
## Dataset ## Dataset
...@@ -17,7 +17,7 @@ tar zxvf data_aishell3.tgz -C data_aishell3 ...@@ -17,7 +17,7 @@ tar zxvf data_aishell3.tgz -C data_aishell3
``` ```
### Get MFA Result and Extract ### Get MFA Result and Extract
We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2. We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo. You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
## Get Started ## Get Started
Assume the path to the dataset is `~/datasets/data_aishell3`. Assume the path to the dataset is `~/datasets/data_aishell3`.
...@@ -32,7 +32,7 @@ Run the command below to ...@@ -32,7 +32,7 @@ Run the command below to
```bash ```bash
./run.sh ./run.sh
``` ```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash ```bash
./run.sh --stage 0 --stop-stage 0 ./run.sh --stage 0 --stop-stage 0
``` ```
...@@ -59,9 +59,9 @@ dump ...@@ -59,9 +59,9 @@ dump
├── raw ├── raw
└── speech_stats.npy └── speech_stats.npy
``` ```
The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains speech、pitch and energy features of each utterances, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`. The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech、pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, durations, path of speech features, path of pitch features, path of energy features, speaker and id of each utterance. Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, the path of pitch features, a path of energy features, speaker, and id of each utterance.
### Model Training ### Model Training
`./local/train.sh` calls `${BIN_DIR}/train.py`. `./local/train.sh` calls `${BIN_DIR}/train.py`.
...@@ -95,14 +95,14 @@ optional arguments: ...@@ -95,14 +95,14 @@ optional arguments:
``` ```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. 3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
5. `--phones-dict` is the path of the phone vocabulary file. 5. `--phones-dict` is the path of the phone vocabulary file.
6. `--speaker-dict`is the path of the speaker id map file when training a multi-speaker FastSpeech2. 6. `--speaker-dict` is the path of the speaker id map file when training a multi-speaker FastSpeech2.
### Synthesizing ### Synthesizing
We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1) as the neural vocoder. We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1) as the neural vocoder.
Download pretrained parallel wavegan model from [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip) and unzip it. Download the pretrained parallel wavegan model from [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip) and unzip it.
```bash ```bash
unzip pwg_aishell3_ckpt_0.5.zip unzip pwg_aishell3_ckpt_0.5.zip
``` ```
...@@ -113,98 +113,115 @@ pwg_aishell3_ckpt_0.5 ...@@ -113,98 +113,115 @@ pwg_aishell3_ckpt_0.5
├── feats_stats.npy # statistics used to normalize spectrogram when training parallel wavegan ├── feats_stats.npy # statistics used to normalize spectrogram when training parallel wavegan
└── snapshot_iter_1000000.pdz # generator parameters of parallel wavegan └── snapshot_iter_1000000.pdz # generator parameters of parallel wavegan
``` ```
`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`. `./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
```bash ```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
``` ```
```text ```text
usage: synthesize.py [-h] [--fastspeech2-config FASTSPEECH2_CONFIG] usage: synthesize.py [-h]
[--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT] [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
[--fastspeech2-stat FASTSPEECH2_STAT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--pwg-config PWG_CONFIG] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--pwg-checkpoint PWG_CHECKPOINT] [--pwg-stat PWG_STAT] [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT]
[--phones-dict PHONES_DICT] [--speaker-dict SPEAKER_DICT] [--voice-cloning VOICE_CLONING]
[--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
[--ngpu NGPU] [--verbose VERBOSE] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--ngpu NGPU]
[--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
Synthesize with fastspeech2 & parallel wavegan. Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--fastspeech2-config FASTSPEECH2_CONFIG --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
fastspeech2 config file. Choose acoustic model type of tts task.
--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT --am_config AM_CONFIG
fastspeech2 checkpoint to load. Config of acoustic model. Use deault config when it is
--fastspeech2-stat FASTSPEECH2_STAT None.
mean and standard deviation used to normalize --am_ckpt AM_CKPT Checkpoint file of acoustic model.
spectrogram when training fastspeech2. --am_stat AM_STAT mean and standard deviation used to normalize
--pwg-config PWG_CONFIG spectrogram when training acoustic model.
parallel wavegan config file. --phones_dict PHONES_DICT
--pwg-checkpoint PWG_CHECKPOINT
parallel wavegan generator parameters to load.
--pwg-stat PWG_STAT mean and standard deviation used to normalize
spectrogram when training parallel wavegan.
--phones-dict PHONES_DICT
phone vocabulary file. phone vocabulary file.
--speaker-dict SPEAKER_DICT --tones_dict TONES_DICT
speaker id map file for multiple speaker model. tone vocabulary file.
--test-metadata TEST_METADATA --speaker_dict SPEAKER_DICT
speaker id map file.
--voice-cloning VOICE_CLONING
whether training voice cloning model.
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
Choose vocoder type of tts task.
--voc_config VOC_CONFIG
Config of voc. Use deault config when it is None.
--voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc.
--ngpu NGPU if ngpu == 0, use cpu.
--test_metadata TEST_METADATA
test metadata. test metadata.
--output-dir OUTPUT_DIR --output_dir OUTPUT_DIR
output dir. output dir.
--ngpu NGPU if ngpu == 0, use cpu.
--verbose VERBOSE verbose
``` ```
`./local/synthesize_e2e.sh` calls `${BIN_DIR}/multi_spk_synthesize_e2e.py`, which can synthesize waveform from text file. `./local/synthesize_e2e.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file.
```bash ```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name}
``` ```
```text ```text
usage: multi_spk_synthesize_e2e.py [-h] usage: synthesize_e2e.py [-h]
[--fastspeech2-config FASTSPEECH2_CONFIG] [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
[--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--fastspeech2-stat FASTSPEECH2_STAT] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--pwg-config PWG_CONFIG] [--tones_dict TONES_DICT]
[--pwg-checkpoint PWG_CHECKPOINT] [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
[--pwg-stat PWG_STAT] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
[--phones-dict PHONES_DICT] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--speaker-dict SPEAKER_DICT] [--text TEXT] [--voc_stat VOC_STAT] [--lang LANG]
[--output-dir OUTPUT_DIR] [--ngpu NGPU] [--inference_dir INFERENCE_DIR] [--ngpu NGPU]
[--verbose VERBOSE] [--text TEXT] [--output_dir OUTPUT_DIR]
Synthesize with fastspeech2 & parallel wavegan. Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--fastspeech2-config FASTSPEECH2_CONFIG --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
fastspeech2 config file. Choose acoustic model type of tts task.
--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT --am_config AM_CONFIG
fastspeech2 checkpoint to load. Config of acoustic model. Use deault config when it is
--fastspeech2-stat FASTSPEECH2_STAT None.
mean and standard deviation used to normalize --am_ckpt AM_CKPT Checkpoint file of acoustic model.
spectrogram when training fastspeech2. --am_stat AM_STAT mean and standard deviation used to normalize
--pwg-config PWG_CONFIG spectrogram when training acoustic model.
parallel wavegan config file. --phones_dict PHONES_DICT
--pwg-checkpoint PWG_CHECKPOINT
parallel wavegan generator parameters to load.
--pwg-stat PWG_STAT mean and standard deviation used to normalize
spectrogram when training parallel wavegan.
--phones-dict PHONES_DICT
phone vocabulary file. phone vocabulary file.
--speaker-dict SPEAKER_DICT --tones_dict TONES_DICT
tone vocabulary file.
--speaker_dict SPEAKER_DICT
speaker id map file. speaker id map file.
--spk_id SPK_ID spk id for multi speaker acoustic model
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
Choose vocoder type of tts task.
--voc_config VOC_CONFIG
Config of voc. Use deault config when it is None.
--voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc.
--lang LANG Choose model language. zh or en
--inference_dir INFERENCE_DIR
dir to save inference models
--ngpu NGPU if ngpu == 0, use cpu.
--text TEXT text to synthesize, a 'utt_id sentence' pair per line. --text TEXT text to synthesize, a 'utt_id sentence' pair per line.
--output-dir OUTPUT_DIR --output_dir OUTPUT_DIR
output dir. output dir.
--ngpu NGPU if ngpu == 0, use cpu.
--verbose VERBOSE verbose.
``` ```
1. `--fastspeech2-config`, `--fastspeech2-checkpoint`, `--fastspeech2-stat`, `--phones-dict` and `--speaker-dict` are arguments for fastspeech2, which correspond to the 5 files in the fastspeech2 pretrained model. 1. `--am` is acoustic model type with the format {model_name}_{dataset}
2. `--pwg-config`, `--pwg-checkpoint`, `--pwg-stat` are arguments for parallel wavegan, which correspond to the 3 files in the parallel wavegan pretrained model. 2. `--am_config`, `--am_checkpoint`, `--am_stat`, `--phones_dict` `--speaker_dict` are arguments for acoustic model, which correspond to the 5 files in the fastspeech2 pretrained model.
3. `--test-metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder. 3. `--voc` is vocoder type with the format {model_name}_{dataset}
4. `--text` is the text file, which contains sentences to synthesize. 4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
5. `--output-dir` is the directory to save synthesized audio files. 5. `--lang` is the model language, which can be `zh` or `en`.
6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder.
7. `--text` is the text file, which contains sentences to synthesize.
8. `--output_dir` is the directory to save synthesized audio files.
9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model ## Pretrained Model
Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_ckpt_0.4.zip) Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_ckpt_0.4.zip)
...@@ -225,16 +242,20 @@ source path.sh ...@@ -225,16 +242,20 @@ source path.sh
FLAGS_allocator_strategy=naive_best_fit \ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/multi_spk_synthesize_e2e.py \ python3 ${BIN_DIR}/../synthesize_e2e.py \
--fastspeech2-config=fastspeech2_nosil_aishell3_ckpt_0.4/default.yaml \ --am=fastspeech2_aishell3 \
--fastspeech2-checkpoint=fastspeech2_nosil_aishell3_ckpt_0.4/snapshot_iter_96400.pdz \ --am_config=fastspeech2_nosil_aishell3_ckpt_0.4/default.yaml \
--fastspeech2-stat=fastspeech2_nosil_aishell3_ckpt_0.4/speech_stats.npy \ --am_ckpt=fastspeech2_nosil_aishell3_ckpt_0.4/snapshot_iter_96400.pdz \
--pwg-config=pwg_aishell3_ckpt_0.5/default.yaml \ --am_stat=fastspeech2_nosil_aishell3_ckpt_0.4/speech_stats.npy \
--pwg-checkpoint=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ --voc=pwgan_aishell3 \
--pwg-stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
--voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
--voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \ --text=${BIN_DIR}/../sentences.txt \
--output-dir=exp/default/test_e2e \ --output_dir=exp/default/test_e2e \
--phones-dict=fastspeech2_nosil_aishell3_ckpt_0.4/phone_id_map.txt \ --phones_dict=fastspeech2_nosil_aishell3_ckpt_0.4/phone_id_map.txt \
--speaker-dict=fastspeech2_nosil_aishell3_ckpt_0.4/speaker_id_map.txt --speaker_dict=fastspeech2_nosil_aishell3_ckpt_0.4/speaker_id_map.txt \
--spk_id=0
``` ```
...@@ -3,9 +3,9 @@ ...@@ -3,9 +3,9 @@
########################################################### ###########################################################
fs: 24000 # sr fs: 24000 # sr
n_fft: 2048 # FFT size. n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size. n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length. win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size. # If set to null, it will be the same as fft_size.
window: "hann" # Window function. window: "hann" # Window function.
......
...@@ -6,14 +6,16 @@ ckpt_name=$3 ...@@ -6,14 +6,16 @@ ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize.py \ python3 ${BIN_DIR}/../synthesize.py \
--fastspeech2-config=${config_path} \ --am=fastspeech2_aishell3 \
--fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ --am_config=${config_path} \
--fastspeech2-stat=dump/train/speech_stats.npy \ --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--pwg-config=pwg_aishell3_ckpt_0.5/default.yaml \ --am_stat=dump/train/speech_stats.npy \
--pwg-checkpoint=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ --voc=pwgan_aishell3 \
--pwg-stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
--test-metadata=dump/test/norm/metadata.jsonl \ --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
--output-dir=${train_output_path}/test \ --voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
--phones-dict=dump/phone_id_map.txt \ --test_metadata=dump/test/norm/metadata.jsonl \
--speaker-dict=dump/speaker_id_map.txt --output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt
...@@ -6,14 +6,18 @@ ckpt_name=$3 ...@@ -6,14 +6,18 @@ ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/multi_spk_synthesize_e2e.py \ python3 ${BIN_DIR}/../synthesize_e2e.py \
--fastspeech2-config=${config_path} \ --am=fastspeech2_aishell3 \
--fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ --am_config=${config_path} \
--fastspeech2-stat=dump/train/speech_stats.npy \ --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--pwg-config=pwg_aishell3_ckpt_0.5/default.yaml \ --am_stat=dump/train/speech_stats.npy \
--pwg-checkpoint=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ --voc=pwgan_aishell3 \
--pwg-stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
--text=${BIN_DIR}/../sentences.txt \ --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
--output-dir=${train_output_path}/test_e2e \ --voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
--phones-dict=dump/phone_id_map.txt \ --lang=zh \
--speaker-dict=dump/speaker_id_map.txt --text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \
--spk_id=0
# Tacotron2 + AISHELL-3 Voice Cloning # Tacotron2 + AISHELL-3 Voice Cloning
This example contains code used to train a [Tacotron2 ](https://arxiv.org/abs/1712.05884) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf) . The general steps are as follows: This example contains code used to train a [Tacotron2 ](https://arxiv.org/abs/1712.05884) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf). The general steps are as follows:
1. Speaker Encoder: We use a Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in Tacotron2, because the transcriptions are not needed, we use more datasets, refer to [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e). 1. Speaker Encoder: We use Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in Tacotron2 because the transcriptions are not needed, we use more datasets, refer to [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e).
2. Synthesizer: We use the trained speaker encoder to generate speaker embedding for each sentence in AISHELL-3. This embedding is an extra input of Tacotron2 which will be concated with encoder outputs. 2. Synthesizer: We use the trained speaker encoder to generate speaker embedding for each sentence in AISHELL-3. This embedding is an extra input of Tacotron2 which will be concated with encoder outputs.
3. Vocoder: We use WaveFlow as the neural Vocoder, refer to [waveflow](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0). 3. Vocoder: We use WaveFlow as the neural Vocoder, refer to [waveflow](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0).
...@@ -39,13 +39,13 @@ fi ...@@ -39,13 +39,13 @@ fi
The computing time of utterance embedding can be x hours. The computing time of utterance embedding can be x hours.
#### Process Wav #### Process Wav
There are silence in the edge of AISHELL-3's wavs, and the audio amplitude is very small, so, we need to remove the silence and normalize the audio. You can the silence remove method based on volume or energy, but the effect is not very good, We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get the alignment of text and speech, then utilize the alignment results to remove the silence. There is silence in the edge of AISHELL-3's wavs, and the audio amplitude is very small, so, we need to remove the silence and normalize the audio. You can the silence remove method based on volume or energy, but the effect is not very good, We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get the alignment of text and speech, then utilize the alignment results to remove the silence.
We use Montreal Force Aligner 1.0. The label in aishell3 include pinyin,so the lexicon we provided to MFA is pinyin rather than Chinese characters. And the prosody marks(`$` and `%`) need to be removed. You shoud preprocess the dataset into the format which MFA needs, the texts have the same name with wavs and have the suffix `.lab`. We use Montreal Force Aligner 1.0. The label in aishell3 includes pinyin,so the lexicon we provided to MFA is pinyin rather than Chinese characters. And the prosody marks(`$` and `%`) need to be removed. You should preprocess the dataset into the format which MFA needs, the texts have the same name with wavs and have the suffix `.lab`.
We use [lexicon.txt](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/paddlespeech/t2s/exps/voice_cloning/tacotron2_ge2e/lexicon.txt) as the lexicon. We use [lexicon.txt](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/paddlespeech/t2s/exps/voice_cloning/tacotron2_ge2e/lexicon.txt) as the lexicon.
You can download the alignment results from here [alignment_aishell3.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/alignment_aishell3.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo. You can download the alignment results from here [alignment_aishell3.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/alignment_aishell3.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
```bash ```bash
...@@ -83,9 +83,9 @@ fi ...@@ -83,9 +83,9 @@ fi
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${preprocess_path} ${train_output_path} CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${preprocess_path} ${train_output_path}
``` ```
Our model remve stop token prediction in Tacotron2, because of the problem of extremely unbalanced proportion of positive and negative samples of stop token prediction, and it's very sensitive to the clip of audio silence. We use the last symbol from the highest point of attention to the encoder side as the termination condition. Our model removes stop token prediction in Tacotron2, because of the problem of the extremely unbalanced proportion of positive and negative samples of stop token prediction, and it's very sensitive to the clip of audio silence. We use the last symbol from the highest point of attention to the encoder side as the termination condition.
In addition, in order to accelerate the convergence of the model, we add `guided attention loss` to induce the alignment between encoder and decoder to show diagonal lines faster. In addition, to accelerate the convergence of the model, we add `guided attention loss` to induce the alignment between encoder and decoder to show diagonal lines faster.
### Voice Cloning ### Voice Cloning
```bash ```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${ge2e_params_path} ${tacotron2_params_path} ${waveflow_params_path} ${vc_input} ${vc_output} CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${ge2e_params_path} ${tacotron2_params_path} ${waveflow_params_path} ${vc_input} ${vc_output}
......
...@@ -7,8 +7,8 @@ vc_input=$4 ...@@ -7,8 +7,8 @@ vc_input=$4
vc_output=$5 vc_output=$5
python3 ${BIN_DIR}/voice_cloning.py \ python3 ${BIN_DIR}/voice_cloning.py \
--ge2e_params_path=${ge2e_params_path} \ --ge2e_params_path=${ge2e_params_path} \
--tacotron2_params_path=${tacotron2_params_path} \ --tacotron2_params_path=${tacotron2_params_path} \
--waveflow_params_path=${waveflow_params_path} \ --waveflow_params_path=${waveflow_params_path} \
--input-dir=${vc_input} \ --input-dir=${vc_input} \
--output-dir=${vc_output} --output-dir=${vc_output}
\ No newline at end of file \ No newline at end of file
# FastSpeech2 + AISHELL-3 Voice Cloning # FastSpeech2 + AISHELL-3 Voice Cloning
This example contains code used to train a [FastSpeech2](https://arxiv.org/abs/2006.04558) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf) . The general steps are as follows: This example contains code used to train a [FastSpeech2](https://arxiv.org/abs/2006.04558) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf). The general steps are as follows:
1. Speaker Encoder: We use a Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in `FastSpeech2`, because the transcriptions are not needed, we use more datasets, refer to [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e). 1. Speaker Encoder: We use Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in `FastSpeech2` because the transcriptions are not needed, we use more datasets, refer to [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e).
2. Synthesizer: We use the trained speaker encoder to generate speaker embedding for each sentence in AISHELL-3. This embedding is an extra input of `FastSpeech2` which will be concated with encoder outputs. 2. Synthesizer: We use the trained speaker encoder to generate speaker embedding for each sentence in AISHELL-3. This embedding is an extra input of `FastSpeech2` which will be concated with encoder outputs.
3. Vocoder: We use [Parallel Wave GAN](http://arxiv.org/abs/1910.11480) as the neural Vocoder, refer to [voc1](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1). 3. Vocoder: We use [Parallel Wave GAN](http://arxiv.org/abs/1910.11480) as the neural Vocoder, refer to [voc1](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1).
...@@ -18,7 +18,7 @@ tar zxvf data_aishell3.tgz -C data_aishell3 ...@@ -18,7 +18,7 @@ tar zxvf data_aishell3.tgz -C data_aishell3
``` ```
### Get MFA Result and Extract ### Get MFA Result and Extract
We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2. We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo. You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
## Pretrained GE2E Model ## Pretrained GE2E Model
We use pretrained GE2E model to generate speaker embedding for each sentence. We use pretrained GE2E model to generate speaker embedding for each sentence.
...@@ -39,7 +39,7 @@ Run the command below to ...@@ -39,7 +39,7 @@ Run the command below to
```bash ```bash
./run.sh ./run.sh
``` ```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash ```bash
./run.sh --stage 0 --stop-stage 0 ./run.sh --stage 0 --stop-stage 0
``` ```
...@@ -72,20 +72,20 @@ dump ...@@ -72,20 +72,20 @@ dump
``` ```
The `embed` contains the generated speaker embedding for each sentence in AISHELL-3, which has the same file structure with wav files and the format is `.npy`. The `embed` contains the generated speaker embedding for each sentence in AISHELL-3, which has the same file structure with wav files and the format is `.npy`.
The computing time of utterance embedding can be x hours. The computing time of utterance embedding can be x hours.
The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains speech、pitch and energy features of each utterances, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`. The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech、pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, durations, path of speech features, path of pitch features, path of energy features, speaker and id of each utterance. Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, the path of pitch features, the path of energy features, speaker, and id of each utterance.
The preprocessing step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but there is one more `ge2e/inference` step here. The preprocessing step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but there is one more `ge2e/inference` step here.
### Model Training ### Model Training
`./local/train.sh` calls `${BIN_DIR}/train.py`. `./local/train.sh` calls `${BIN_DIR}/train.py`.
```bash ```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
``` ```
The training step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but we should set `--voice-cloning=True` when calling `${BIN_DIR}/train.py`. The training step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but we should set `--voice-cloning=True` when calling `${BIN_DIR}/train.py`.
### Synthesizing ### Synthesizing
We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1) as the neural vocoder. We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1) as the neural vocoder.
...@@ -100,11 +100,11 @@ pwg_aishell3_ckpt_0.5 ...@@ -100,11 +100,11 @@ pwg_aishell3_ckpt_0.5
├── feats_stats.npy # statistics used to normalize spectrogram when training parallel wavegan ├── feats_stats.npy # statistics used to normalize spectrogram when training parallel wavegan
└── snapshot_iter_1000000.pdz # generator parameters of parallel wavegan └── snapshot_iter_1000000.pdz # generator parameters of parallel wavegan
``` ```
`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`. `./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
```bash ```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
``` ```
The synthesizing step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but we should set `--voice-cloning=True` when calling `${BIN_DIR}/synthesize.py`. The synthesizing step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but we should set `--voice-cloning=True` when calling `${BIN_DIR}/../synthesize.py`.
### Voice Cloning ### Voice Cloning
Assume there are some reference audios in `./ref_audio` Assume there are some reference audios in `./ref_audio`
......
...@@ -3,9 +3,9 @@ ...@@ -3,9 +3,9 @@
########################################################### ###########################################################
fs: 24000 # sr fs: 24000 # sr
n_fft: 2048 # FFT size. n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size. n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length. win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size. # If set to null, it will be the same as fft_size.
window: "hann" # Window function. window: "hann" # Window function.
......
...@@ -6,14 +6,17 @@ ckpt_name=$3 ...@@ -6,14 +6,17 @@ ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize.py \ python3 ${BIN_DIR}/../synthesize.py \
--fastspeech2-config=${config_path} \ --am=fastspeech2_aishell3 \
--fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ --am_config=${config_path} \
--fastspeech2-stat=dump/train/speech_stats.npy \ --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--pwg-config=pwg_aishell3_ckpt_0.5/default.yaml \ --am_stat=dump/train/speech_stats.npy \
--pwg-checkpoint=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ --voc=pwgan_aishell3 \
--pwg-stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
--test-metadata=dump/test/norm/metadata.jsonl \ --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
--output-dir=${train_output_path}/test \ --voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
--phones-dict=dump/phone_id_map.txt \ --test_metadata=dump/test/norm/metadata.jsonl \
--voice-cloning=True --output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \
--voice-cloning=True
...@@ -9,14 +9,14 @@ ref_audio_dir=$5 ...@@ -9,14 +9,14 @@ ref_audio_dir=$5
FLAGS_allocator_strategy=naive_best_fit \ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/voice_cloning.py \ python3 ${BIN_DIR}/voice_cloning.py \
--fastspeech2-config=${config_path} \ --fastspeech2-config=${config_path} \
--fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ --fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
--fastspeech2-stat=dump/train/speech_stats.npy \ --fastspeech2-stat=dump/train/speech_stats.npy \
--pwg-config=pwg_aishell3_ckpt_0.5/default.yaml \ --pwg-config=pwg_aishell3_ckpt_0.5/default.yaml \
--pwg-checkpoint=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ --pwg-checkpoint=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
--pwg-stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ --pwg-stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
--ge2e_params_path=${ge2e_params_path} \ --ge2e_params_path=${ge2e_params_path} \
--text="凯莫瑞安联合体的经济崩溃迫在眉睫。" \ --text="凯莫瑞安联合体的经济崩溃迫在眉睫。" \
--input-dir=${ref_audio_dir} \ --input-dir=${ref_audio_dir} \
--output-dir=${train_output_path}/vc_syn \ --output-dir=${train_output_path}/vc_syn \
--phones-dict=dump/phone_id_map.txt --phones-dict=dump/phone_id_map.txt
# Parallel WaveGAN with AISHELL-3 # Parallel WaveGAN with AISHELL-3
This example contains code used to train a [parallel wavegan](http://arxiv.org/abs/1910.11480) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). This example contains code used to train a [parallel wavegan](http://arxiv.org/abs/1910.11480) model with [AISHELL-3](http://www.aishelltech.com/aishell_3).
AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus that could be used to train multi-speaker Text-to-Speech (TTS) systems.
## Dataset ## Dataset
### Download and Extract ### Download and Extract
Download AISHELL-3. Download AISHELL-3.
...@@ -15,7 +15,7 @@ tar zxvf data_aishell3.tgz -C data_aishell3 ...@@ -15,7 +15,7 @@ tar zxvf data_aishell3.tgz -C data_aishell3
``` ```
### Get MFA Result and Extract ### Get MFA Result and Extract
We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2. We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo. You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
## Get Started ## Get Started
Assume the path to the dataset is `~/datasets/data_aishell3`. Assume the path to the dataset is `~/datasets/data_aishell3`.
...@@ -53,9 +53,9 @@ dump ...@@ -53,9 +53,9 @@ dump
└── feats_stats.npy └── feats_stats.npy
``` ```
The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains log magnitude of mel spectrogram of each utterances, while the norm folder contains normalized spectrogram. The statistics used to normalize the spectrogram is computed from the training set, which is located in `dump/train/feats_stats.npy`. The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance. Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
### Model Training ### Model Training
```bash ```bash
...@@ -101,7 +101,7 @@ benchmark: ...@@ -101,7 +101,7 @@ benchmark:
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. 3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
### Synthesizing ### Synthesizing
...@@ -110,15 +110,19 @@ benchmark: ...@@ -110,15 +110,19 @@ benchmark:
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
``` ```
```text ```text
usage: synthesize.py [-h] [--config CONFIG] [--checkpoint CHECKPOINT] usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
[--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR] [--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
[--ngpu NGPU] [--verbose VERBOSE] [--output-dir OUTPUT_DIR] [--ngpu NGPU]
[--verbose VERBOSE]
Synthesize with parallel wavegan. Synthesize with GANVocoder.
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--config CONFIG parallel wavegan config file. --generator-type GENERATOR_TYPE
type of GANVocoder, should in {pwgan, mb_melgan,
style_melgan, } now
--config CONFIG GANVocoder config file.
--checkpoint CHECKPOINT --checkpoint CHECKPOINT
snapshot to load. snapshot to load.
--test-metadata TEST_METADATA --test-metadata TEST_METADATA
......
...@@ -7,9 +7,9 @@ ...@@ -7,9 +7,9 @@
# FEATURE EXTRACTION SETTING # # FEATURE EXTRACTION SETTING #
########################################################### ###########################################################
fs: 24000 # Sampling rate. fs: 24000 # Sampling rate.
n_fft: 2048 # FFT size. (in samples) n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size. (in samples) n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length. (in samples) win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size. # If set to null, it will be the same as fft_size.
window: "hann" # Window function. window: "hann" # Window function.
n_mels: 80 # Number of mel basis. n_mels: 80 # Number of mel basis.
...@@ -49,9 +49,9 @@ discriminator_params: ...@@ -49,9 +49,9 @@ discriminator_params:
bias: true # Whether to use bias parameter in conv. bias: true # Whether to use bias parameter in conv.
use_weight_norm: true # Whether to use weight norm. use_weight_norm: true # Whether to use weight norm.
# If set to true, it will be applied to all of the conv layers. # If set to true, it will be applied to all of the conv layers.
nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv. nonlinear_activation: "leakyrelu" # Nonlinear function after each conv.
nonlinear_activation_params: # Nonlinear function parameters nonlinear_activation_params: # Nonlinear function parameters
negative_slope: 0.2 # Alpha in LeakyReLU. negative_slope: 0.2 # Alpha in leakyrelu.
########################################################### ###########################################################
# STFT LOSS SETTING # # STFT LOSS SETTING #
......
...@@ -7,8 +7,8 @@ ckpt_name=$3 ...@@ -7,8 +7,8 @@ ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \ python3 ${BIN_DIR}/../synthesize.py \
--config=${config_path} \ --config=${config_path} \
--checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
--test-metadata=dump/test/norm/metadata.jsonl \ --test-metadata=dump/test/norm/metadata.jsonl \
--output-dir=${train_output_path}/test \ --output-dir=${train_output_path}/test \
--generator-type=pwgan --generator-type=pwgan
...@@ -7,7 +7,7 @@ Download CSMSC from it's [Official Website](https://test.data-baker.com/data/ind ...@@ -7,7 +7,7 @@ Download CSMSC from it's [Official Website](https://test.data-baker.com/data/ind
### Get MFA Result and Extract ### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for SPEEDYSPEECH. We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for SPEEDYSPEECH.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
## Get Started ## Get Started
Assume the path to the dataset is `~/datasets/BZNSYP`. Assume the path to the dataset is `~/datasets/BZNSYP`.
...@@ -18,8 +18,8 @@ Run the command below to ...@@ -18,8 +18,8 @@ Run the command below to
3. train the model. 3. train the model.
4. synthesize wavs. 4. synthesize wavs.
- synthesize waveform from `metadata.jsonl`. - synthesize waveform from `metadata.jsonl`.
- synthesize waveform from text file. - synthesize waveform from a text file.
5. inference using static model. 5. inference using the static model.
```bash ```bash
./run.sh ./run.sh
``` ```
...@@ -47,9 +47,9 @@ dump ...@@ -47,9 +47,9 @@ dump
└── feats_stats.npy └── feats_stats.npy
``` ```
The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains log magnitude of mel spectrogram of each utterances, while the norm folder contains normalized spectrogram. The statistics used to normalize the spectrogram is computed from the training set, which is located in `dump/train/feats_stats.npy`. The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, tones, durations, path of spectrogram, and id of each utterance. Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, tones, durations, the path of the spectrogram, and the id of each utterance.
### Model Training ### Model Training
`./local/train.sh` calls `${BIN_DIR}/train.py`. `./local/train.sh` calls `${BIN_DIR}/train.py`.
...@@ -64,7 +64,7 @@ usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA] ...@@ -64,7 +64,7 @@ usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--use-relative-path USE_RELATIVE_PATH] [--use-relative-path USE_RELATIVE_PATH]
[--phones-dict PHONES_DICT] [--tones-dict TONES_DICT] [--phones-dict PHONES_DICT] [--tones-dict TONES_DICT]
Train a Speedyspeech model with sigle speaker dataset. Train a Speedyspeech model with a single speaker dataset.
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
...@@ -87,7 +87,7 @@ optional arguments: ...@@ -87,7 +87,7 @@ optional arguments:
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. 3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
5. `--phones-dict` is the path of the phone vocabulary file. 5. `--phones-dict` is the path of the phone vocabulary file.
6. `--tones-dict` is the path of the tone vocabulary file. 6. `--tones-dict` is the path of the tone vocabulary file.
...@@ -105,107 +105,118 @@ pwg_baker_ckpt_0.4 ...@@ -105,107 +105,118 @@ pwg_baker_ckpt_0.4
├── pwg_snapshot_iter_400000.pdz # model parameters of parallel wavegan ├── pwg_snapshot_iter_400000.pdz # model parameters of parallel wavegan
└── pwg_stats.npy # statistics used to normalize spectrogram when training parallel wavegan └── pwg_stats.npy # statistics used to normalize spectrogram when training parallel wavegan
``` ```
`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`. `./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
```bash ```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
``` ```
```text ``text
usage: synthesize.py [-h] [--speedyspeech-config SPEEDYSPEECH_CONFIG] usage: synthesize.py [-h]
[--speedyspeech-checkpoint SPEEDYSPEECH_CHECKPOINT] [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
[--speedyspeech-stat SPEEDYSPEECH_STAT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--pwg-config PWG_CONFIG] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--pwg-checkpoint PWG_CHECKPOINT] [--pwg-stat PWG_STAT] [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT]
[--phones-dict PHONES_DICT] [--tones-dict TONES_DICT] [--voice-cloning VOICE_CLONING]
[--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
[--inference-dir INFERENCE_DIR] [--ngpu NGPU] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--verbose VERBOSE] [--voc_stat VOC_STAT] [--ngpu NGPU]
[--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
Synthesize with speedyspeech & parallel wavegan.
Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--speedyspeech-config SPEEDYSPEECH_CONFIG --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
config file for speedyspeech. Choose acoustic model type of tts task.
--speedyspeech-checkpoint SPEEDYSPEECH_CHECKPOINT --am_config AM_CONFIG
speedyspeech checkpoint to load. Config of acoustic model. Use deault config when it is
--speedyspeech-stat SPEEDYSPEECH_STAT None.
mean and standard deviation used to normalize --am_ckpt AM_CKPT Checkpoint file of acoustic model.
spectrogram when training speedyspeech. --am_stat AM_STAT mean and standard deviation used to normalize
--pwg-config PWG_CONFIG spectrogram when training acoustic model.
config file for parallelwavegan. --phones_dict PHONES_DICT
--pwg-checkpoint PWG_CHECKPOINT
parallel wavegan generator parameters to load.
--pwg-stat PWG_STAT mean and standard deviation used to normalize
spectrogram when training speedyspeech.
--phones-dict PHONES_DICT
phone vocabulary file. phone vocabulary file.
--tones-dict TONES_DICT --tones_dict TONES_DICT
tone vocabulary file. tone vocabulary file.
--test-metadata TEST_METADATA --speaker_dict SPEAKER_DICT
test metadata speaker id map file.
--output-dir OUTPUT_DIR --voice-cloning VOICE_CLONING
output dir whether training voice cloning model.
--inference-dir INFERENCE_DIR --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
dir to save inference models Choose vocoder type of tts task.
--voc_config VOC_CONFIG
Config of voc. Use deault config when it is None.
--voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc.
--ngpu NGPU if ngpu == 0, use cpu. --ngpu NGPU if ngpu == 0, use cpu.
--verbose VERBOSE verbose --test_metadata TEST_METADATA
test metadata.
--output_dir OUTPUT_DIR
output dir.
``` ```
`./local/synthesize_e2e.sh` calls `${BIN_DIR}/synthesize_e2e.py`, which can synthesize waveform from text file. `./local/synthesize_e2e.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file.
```bash ```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name}
``` ```
```text ```text
usage: synthesize_e2e.py [-h] [--speedyspeech-config SPEEDYSPEECH_CONFIG] usage: synthesize_e2e.py [-h]
[--speedyspeech-checkpoint SPEEDYSPEECH_CHECKPOINT] [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
[--speedyspeech-stat SPEEDYSPEECH_STAT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--pwg-config PWG_CONFIG] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--pwg-checkpoint PWG_CHECKPOINT] [--tones_dict TONES_DICT]
[--pwg-stat PWG_STAT] [--text TEXT] [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
[--phones-dict PHONES_DICT] [--tones-dict TONES_DICT] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
[--output-dir OUTPUT_DIR] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--inference-dir INFERENCE_DIR] [--verbose VERBOSE] [--voc_stat VOC_STAT] [--lang LANG]
[--ngpu NGPU] [--inference_dir INFERENCE_DIR] [--ngpu NGPU]
[--text TEXT] [--output_dir OUTPUT_DIR]
Synthesize with speedyspeech & parallel wavegan.
Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--speedyspeech-config SPEEDYSPEECH_CONFIG --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
config file for speedyspeech. Choose acoustic model type of tts task.
--speedyspeech-checkpoint SPEEDYSPEECH_CHECKPOINT --am_config AM_CONFIG
speedyspeech checkpoint to load. Config of acoustic model. Use deault config when it is
--speedyspeech-stat SPEEDYSPEECH_STAT None.
mean and standard deviation used to normalize --am_ckpt AM_CKPT Checkpoint file of acoustic model.
spectrogram when training speedyspeech. --am_stat AM_STAT mean and standard deviation used to normalize
--pwg-config PWG_CONFIG spectrogram when training acoustic model.
config file for parallelwavegan. --phones_dict PHONES_DICT
--pwg-checkpoint PWG_CHECKPOINT
parallel wavegan checkpoint to load.
--pwg-stat PWG_STAT mean and standard deviation used to normalize
spectrogram when training speedyspeech.
--text TEXT text to synthesize, a 'utt_id sentence' pair per line
--phones-dict PHONES_DICT
phone vocabulary file. phone vocabulary file.
--tones-dict TONES_DICT --tones_dict TONES_DICT
tone vocabulary file. tone vocabulary file.
--output-dir OUTPUT_DIR --speaker_dict SPEAKER_DICT
output dir speaker id map file.
--inference-dir INFERENCE_DIR --spk_id SPK_ID spk id for multi speaker acoustic model
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
Choose vocoder type of tts task.
--voc_config VOC_CONFIG
Config of voc. Use deault config when it is None.
--voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc.
--lang LANG Choose model language. zh or en
--inference_dir INFERENCE_DIR
dir to save inference models dir to save inference models
--verbose VERBOSE verbose
--ngpu NGPU if ngpu == 0, use cpu. --ngpu NGPU if ngpu == 0, use cpu.
--text TEXT text to synthesize, a 'utt_id sentence' pair per line.
--output_dir OUTPUT_DIR
output dir.
``` ```
1. `--speedyspeech-config`, `--speedyspeech-checkpoint`, `--speedyspeech-stat` are arguments for speedyspeech, which correspond to the 3 files in the speedyspeech pretrained model. 1. `--am` is acoustic model type with the format {model_name}_{dataset}
2. `--pwg-config`, `--pwg-checkpoint`, `--pwg-stat` are arguments for parallel wavegan, which correspond to the 3 files in the parallel wavegan pretrained model. 2. `--am_config`, `--am_checkpoint`, `--am_stat`, `--phones_dict` and `--tones_dict` are arguments for acoustic model, which correspond to the 5 files in the speedyspeech pretrained model.
3. `--text` is the text file, which contains sentences to synthesize. 3. `--voc` is vocoder type with the format {model_name}_{dataset}
4. `--output-dir` is the directory to save synthesized audio files. 4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
5. `--inference-dir` is the directory to save exported model, which can be used with paddle infernece. 5. `--lang` is the model language, which can be `zh` or `en`.
6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder.
7. `--phones-dict` is the path of the phone vocabulary file. 7. `--text` is the text file, which contains sentences to synthesize.
8. `--tones-dict` is the path of the tone vocabulary file. 8. `--output_dir` is the directory to save synthesized audio files.
9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
### Inferencing ### Inferencing
After Synthesize, we will get static models of speedyspeech and pwgan in `${train_output_path}/inference`. After synthesizing, we will get static models of speedyspeech and pwgan in `${train_output_path}/inference`.
`./local/inference.sh` calls `${BIN_DIR}/inference.py`, which provides a paddle static model inference example for speedyspeech + pwgan synthesize. `./local/inference.sh` calls `${BIN_DIR}/inference.py`, which provides a paddle static model inference example for speedyspeech + pwgan synthesize.
```bash ```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path}
...@@ -214,7 +225,7 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} ...@@ -214,7 +225,7 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path}
## Pretrained Model ## Pretrained Model
Pretrained SpeedySpeech model with no silence in the edge of audios[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip). Pretrained SpeedySpeech model with no silence in the edge of audios[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip).
Static model can be downloaded here [speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip). The static model can be downloaded here [speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip).
Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/ssim_loss Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/ssim_loss
:-------------:| :------------:| :-----: | :-----: | :--------:|:--------: :-------------:| :------------:| :-----: | :-----: | :--------:|:--------:
...@@ -235,16 +246,19 @@ source path.sh ...@@ -235,16 +246,19 @@ source path.sh
FLAGS_allocator_strategy=naive_best_fit \ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize_e2e.py \ python3 ${BIN_DIR}/../synthesize_e2e.py \
--speedyspeech-config=speedyspeech_nosil_baker_ckpt_0.5/default.yaml \ --am=speedyspeech_csmsc \
--speedyspeech-checkpoint=speedyspeech_nosil_baker_ckpt_0.5/snapshot_iter_11400.pdz \ --am_config=speedyspeech_nosil_baker_ckpt_0.5/default.yaml \
--speedyspeech-stat=speedyspeech_nosil_baker_ckpt_0.5/feats_stats.npy \ --am_ckpt=speedyspeech_nosil_baker_ckpt_0.5/snapshot_iter_11400.pdz \
--pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \ --am_stat=speedyspeech_nosil_baker_ckpt_0.5/feats_stats.npy \
--pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \ --voc=pwgan_csmsc \
--pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \ --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
--voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
--voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \ --text=${BIN_DIR}/../sentences.txt \
--output-dir=exp/default/test_e2e \ --output_dir=exp/default/test_e2e \
--inference-dir=exp/default/inference \ --inference_dir=exp/default/inference \
--phones-dict=speedyspeech_nosil_baker_ckpt_0.5/phone_id_map.txt \ --phones_dict=speedyspeech_nosil_baker_ckpt_0.5/phone_id_map.txt \
--tones-dict=speedyspeech_nosil_baker_ckpt_0.5/tone_id_map.txt --tones_dict=speedyspeech_nosil_baker_ckpt_0.5/tone_id_map.txt
``` ```
...@@ -2,9 +2,9 @@ ...@@ -2,9 +2,9 @@
# FEATURE EXTRACTION SETTING # # FEATURE EXTRACTION SETTING #
########################################################### ###########################################################
fs: 24000 # Sampling rate. fs: 24000 # Sampling rate.
n_fft: 2048 # FFT size. n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size. n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length. win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size. # If set to null, it will be the same as fft_size.
window: "hann" # Window function. window: "hann" # Window function.
n_mels: 80 # Number of mel basis. n_mels: 80 # Number of mel basis.
......
...@@ -2,9 +2,55 @@ ...@@ -2,9 +2,55 @@
train_output_path=$1 train_output_path=$1
python3 ${BIN_DIR}/inference.py \ stage=0
--inference-dir=${train_output_path}/inference \ stop_stage=0
--text=${BIN_DIR}/../sentences.txt \
--output-dir=${train_output_path}/pd_infer_out \ # pwgan
--phones-dict=dump/phone_id_map.txt \ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
--tones-dict=dump/tone_id_map.txt python3 ${BIN_DIR}/../inference.py \
--inference_dir=${train_output_path}/inference \
--am=speedyspeech_csmsc \
--voc=pwgan_csmsc \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/pd_infer_out \
--phones_dict=dump/phone_id_map.txt \
--tones_dict=dump/tone_id_map.txt
fi
# for more GAN Vocoders
# multi band melgan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
python3 ${BIN_DIR}/../inference.py \
--inference_dir=${train_output_path}/inference \
--am=speedyspeech_csmsc \
--voc=mb_melgan_csmsc \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/pd_infer_out \
--phones_dict=dump/phone_id_map.txt \
--tones_dict=dump/tone_id_map.txt
fi
# style melgan
# style melgan's Dygraph to Static Graph is not ready now
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
python3 ${BIN_DIR}/../inference.py \
--inference_dir=${train_output_path}/inference \
--am=speedyspeech_csmsc \
--voc=style_melgan_csmsc \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/pd_infer_out \
--phones_dict=dump/phone_id_map.txt \
--tones_dict=dump/tone_id_map.txt
fi
# hifigan
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
python3 ${BIN_DIR}/../inference.py \
--inference_dir=${train_output_path}/inference \
--am=speedyspeech_csmsc \
--voc=hifigan_csmsc \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/pd_infer_out \
--phones_dict=dump/phone_id_map.txt \
--tones_dict=dump/tone_id_map.txt
fi
...@@ -5,15 +5,16 @@ ckpt_name=$3 ...@@ -5,15 +5,16 @@ ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize.py \ python3 ${BIN_DIR}/../synthesize.py \
--speedyspeech-config=${config_path} \ --am=speedyspeech_csmsc \
--speedyspeech-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ --am_config=${config_path} \
--speedyspeech-stat=dump/train/feats_stats.npy \ --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \ --am_stat=dump/train/feats_stats.npy \
--pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \ --voc=pwgan_csmsc \
--pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \ --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
--test-metadata=dump/test/norm/metadata.jsonl \ --voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
--output-dir=${train_output_path}/test \ --voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
--inference-dir=${train_output_path}/inference \ --test_metadata=dump/test/norm/metadata.jsonl \
--phones-dict=dump/phone_id_map.txt \ --output_dir=${train_output_path}/test \
--tones-dict=dump/tone_id_map.txt --phones_dict=dump/phone_id_map.txt \
\ No newline at end of file --tones_dict=dump/tone_id_map.txt
\ No newline at end of file
...@@ -4,17 +4,91 @@ config_path=$1 ...@@ -4,17 +4,91 @@ config_path=$1
train_output_path=$2 train_output_path=$2
ckpt_name=$3 ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \ stage=0
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ stop_stage=0
python3 ${BIN_DIR}/synthesize_e2e.py \
--speedyspeech-config=${config_path} \ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
--speedyspeech-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ FLAGS_allocator_strategy=naive_best_fit \
--speedyspeech-stat=dump/train/feats_stats.npy \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
--pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \ python3 ${BIN_DIR}/../synthesize_e2e.py \
--pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \ --am=speedyspeech_csmsc \
--pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \ --am_config=${config_path} \
--text=${BIN_DIR}/../sentences.txt \ --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--output-dir=${train_output_path}/test_e2e \ --am_stat=dump/train/feats_stats.npy \
--inference-dir=${train_output_path}/inference \ --voc=pwgan_csmsc \
--phones-dict=dump/phone_id_map.txt \ --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
--tones-dict=dump/tone_id_map.txt --voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
--voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e \
--inference_dir=${train_output_path}/inference \
--phones_dict=dump/phone_id_map.txt \
--tones_dict=dump/tone_id_map.txt
fi
# for more GAN Vocoders
# multi band melgan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=speedyspeech_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/feats_stats.npy \
--voc=mb_melgan_csmsc \
--voc_config=mb_melgan_baker_finetune_ckpt_0.5/finetune.yaml \
--voc_ckpt=mb_melgan_baker_finetune_ckpt_0.5/snapshot_iter_2000000.pdz\
--voc_stat=mb_melgan_baker_finetune_ckpt_0.5/feats_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e \
--inference_dir=${train_output_path}/inference \
--phones_dict=dump/phone_id_map.txt \
--tones_dict=dump/tone_id_map.txt
fi
# the pretrained models haven't release now
# style melgan
# style melgan's Dygraph to Static Graph is not ready now
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=speedyspeech_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/feats_stats.npy \
--voc=style_melgan_csmsc \
--voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
--voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
--voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--tones_dict=dump/tone_id_map.txt
# --inference_dir=${train_output_path}/inference
fi
# hifigan
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=speedyspeech_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/feats_stats.npy \
--voc=hifigan_csmsc \
--voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
--voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
--voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e \
--inference_dir=${train_output_path}/inference \
--phones_dict=dump/phone_id_map.txt \
--tones_dict=dump/tone_id_map.txt
fi
...@@ -7,7 +7,7 @@ Download CSMSC from it's [Official Website](https://test.data-baker.com/data/ind ...@@ -7,7 +7,7 @@ Download CSMSC from it's [Official Website](https://test.data-baker.com/data/ind
### Get MFA Result and Extract ### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2. We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
## Get Started ## Get Started
Assume the path to the dataset is `~/datasets/BZNSYP`. Assume the path to the dataset is `~/datasets/BZNSYP`.
...@@ -18,12 +18,12 @@ Run the command below to ...@@ -18,12 +18,12 @@ Run the command below to
3. train the model. 3. train the model.
4. synthesize wavs. 4. synthesize wavs.
- synthesize waveform from `metadata.jsonl`. - synthesize waveform from `metadata.jsonl`.
- synthesize waveform from text file. - synthesize waveform from a text file.
5. inference using static model. 5. inference using the static model.
```bash ```bash
./run.sh ./run.sh
``` ```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash ```bash
./run.sh --stage 0 --stop-stage 0 ./run.sh --stage 0 --stop-stage 0
``` ```
...@@ -50,9 +50,9 @@ dump ...@@ -50,9 +50,9 @@ dump
├── raw ├── raw
└── speech_stats.npy └── speech_stats.npy
``` ```
The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains speech、pitch and energy features of each utterances, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`. The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech、pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, durations, path of speech features, path of pitch features, path of energy features, speaker and id of each utterance. Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, the path of pitch features, the path of energy features, speaker, and the id of each utterance.
### Model Training ### Model Training
```bash ```bash
...@@ -86,7 +86,7 @@ optional arguments: ...@@ -86,7 +86,7 @@ optional arguments:
``` ```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. 3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
5. `--phones-dict` is the path of the phone vocabulary file. 5. `--phones-dict` is the path of the phone vocabulary file.
...@@ -103,100 +103,118 @@ pwg_baker_ckpt_0.4 ...@@ -103,100 +103,118 @@ pwg_baker_ckpt_0.4
├── pwg_snapshot_iter_400000.pdz # model parameters of parallel wavegan ├── pwg_snapshot_iter_400000.pdz # model parameters of parallel wavegan
└── pwg_stats.npy # statistics used to normalize spectrogram when training parallel wavegan └── pwg_stats.npy # statistics used to normalize spectrogram when training parallel wavegan
``` ```
`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`. `./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
```bash ```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
``` ```
```text ```text
usage: synthesize.py [-h] [--fastspeech2-config FASTSPEECH2_CONFIG] usage: synthesize.py [-h]
[--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT] [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
[--fastspeech2-stat FASTSPEECH2_STAT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--pwg-config PWG_CONFIG] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--pwg-checkpoint PWG_CHECKPOINT] [--pwg-stat PWG_STAT] [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT]
[--phones-dict PHONES_DICT] [--speaker-dict SPEAKER_DICT] [--voice-cloning VOICE_CLONING]
[--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
[--ngpu NGPU] [--verbose VERBOSE] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--ngpu NGPU]
[--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
Synthesize with fastspeech2 & parallel wavegan. Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--fastspeech2-config FASTSPEECH2_CONFIG --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
fastspeech2 config file. Choose acoustic model type of tts task.
--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT --am_config AM_CONFIG
fastspeech2 checkpoint to load. Config of acoustic model. Use deault config when it is
--fastspeech2-stat FASTSPEECH2_STAT None.
mean and standard deviation used to normalize --am_ckpt AM_CKPT Checkpoint file of acoustic model.
spectrogram when training fastspeech2. --am_stat AM_STAT mean and standard deviation used to normalize
--pwg-config PWG_CONFIG spectrogram when training acoustic model.
parallel wavegan config file. --phones_dict PHONES_DICT
--pwg-checkpoint PWG_CHECKPOINT
parallel wavegan generator parameters to load.
--pwg-stat PWG_STAT mean and standard deviation used to normalize
spectrogram when training parallel wavegan.
--phones-dict PHONES_DICT
phone vocabulary file. phone vocabulary file.
--speaker-dict SPEAKER_DICT --tones_dict TONES_DICT
speaker id map file for multiple speaker model. tone vocabulary file.
--test-metadata TEST_METADATA --speaker_dict SPEAKER_DICT
speaker id map file.
--voice-cloning VOICE_CLONING
whether training voice cloning model.
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
Choose vocoder type of tts task.
--voc_config VOC_CONFIG
Config of voc. Use deault config when it is None.
--voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc.
--ngpu NGPU if ngpu == 0, use cpu.
--test_metadata TEST_METADATA
test metadata. test metadata.
--output-dir OUTPUT_DIR --output_dir OUTPUT_DIR
output dir. output dir.
--ngpu NGPU if ngpu == 0, use cpu.
--verbose VERBOSE verbose.
``` ```
`./local/synthesize_e2e.sh` calls `${BIN_DIR}/synthesize_e2e.py`, which can synthesize waveform from text file. `./local/synthesize_e2e.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file.
```bash ```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name}
``` ```
```text ```text
usage: synthesize_e2e.py [-h] [--fastspeech2-config FASTSPEECH2_CONFIG] usage: synthesize_e2e.py [-h]
[--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT] [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
[--fastspeech2-stat FASTSPEECH2_STAT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--pwg-config PWG_CONFIG] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--pwg-checkpoint PWG_CHECKPOINT] [--tones_dict TONES_DICT]
[--pwg-stat PWG_STAT] [--phones-dict PHONES_DICT] [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
[--text TEXT] [--output-dir OUTPUT_DIR] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
[--inference-dir INFERENCE_DIR] [--ngpu NGPU] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--verbose VERBOSE] [--voc_stat VOC_STAT] [--lang LANG]
[--inference_dir INFERENCE_DIR] [--ngpu NGPU]
Synthesize with fastspeech2 & parallel wavegan. [--text TEXT] [--output_dir OUTPUT_DIR]
Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--fastspeech2-config FASTSPEECH2_CONFIG --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
fastspeech2 config file. Choose acoustic model type of tts task.
--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT --am_config AM_CONFIG
fastspeech2 checkpoint to load. Config of acoustic model. Use deault config when it is
--fastspeech2-stat FASTSPEECH2_STAT None.
mean and standard deviation used to normalize --am_ckpt AM_CKPT Checkpoint file of acoustic model.
spectrogram when training fastspeech2. --am_stat AM_STAT mean and standard deviation used to normalize
--pwg-config PWG_CONFIG spectrogram when training acoustic model.
parallel wavegan config file. --phones_dict PHONES_DICT
--pwg-checkpoint PWG_CHECKPOINT
parallel wavegan generator parameters to load.
--pwg-stat PWG_STAT mean and standard deviation used to normalize
spectrogram when training parallel wavegan.
--phones-dict PHONES_DICT
phone vocabulary file. phone vocabulary file.
--text TEXT text to synthesize, a 'utt_id sentence' pair per line. --tones_dict TONES_DICT
--output-dir OUTPUT_DIR tone vocabulary file.
output dir. --speaker_dict SPEAKER_DICT
--inference-dir INFERENCE_DIR speaker id map file.
--spk_id SPK_ID spk id for multi speaker acoustic model
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
Choose vocoder type of tts task.
--voc_config VOC_CONFIG
Config of voc. Use deault config when it is None.
--voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc.
--lang LANG Choose model language. zh or en
--inference_dir INFERENCE_DIR
dir to save inference models dir to save inference models
--ngpu NGPU if ngpu == 0, use cpu. --ngpu NGPU if ngpu == 0, use cpu.
--verbose VERBOSE verbose. --text TEXT text to synthesize, a 'utt_id sentence' pair per line.
--output_dir OUTPUT_DIR
output dir.
``` ```
1. `--am` is acoustic model type with the format {model_name}_{dataset}
1. `--fastspeech2-config`, `--fastspeech2-checkpoint`, `--fastspeech2-stat` and `--phones-dict` are arguments for fastspeech2, which correspond to the 4 files in the fastspeech2 pretrained model. 2. `--am_config`, `--am_checkpoint`, `--am_stat` and `--phones_dict` are arguments for acoustic model, which correspond to the 4 files in the fastspeech2 pretrained model.
2. `--pwg-config`, `--pwg-checkpoint`, `--pwg-stat` are arguments for parallel wavegan, which correspond to the 3 files in the parallel wavegan pretrained model. 3. `--voc` is vocoder type with the format {model_name}_{dataset}
3. `--test-metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder. 4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
4. `--text` is the text file, which contains sentences to synthesize. 5. `--lang` is the model language, which can be `zh` or `en`.
5. `--output-dir` is the directory to save synthesized audio files. 6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder.
6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 7. `--text` is the text file, which contains sentences to synthesize.
8. `--output_dir` is the directory to save synthesized audio files.
9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
### Inferencing ### Inferencing
After Synthesize, we will get static models of fastspeech2 and pwgan in `${train_output_path}/inference`. After synthesizing, we will get static models of fastspeech2 and pwgan in `${train_output_path}/inference`.
`./local/inference.sh` calls `${BIN_DIR}/inference.py`, which provides a paddle static model inference example for fastspeech2 + pwgan synthesize. `./local/inference.sh` calls `${BIN_DIR}/inference.py`, which provides a paddle static model inference example for fastspeech2 + pwgan synthesize.
```bash ```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path}
...@@ -207,7 +225,7 @@ Pretrained FastSpeech2 model with no silence in the edge of audios: ...@@ -207,7 +225,7 @@ Pretrained FastSpeech2 model with no silence in the edge of audios:
- [fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip) - [fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)
- [fastspeech2_conformer_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip) - [fastspeech2_conformer_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip)
Static model can be downloaded here [fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip). The static model can be downloaded here [fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip).
Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss
:-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------: :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
...@@ -228,15 +246,18 @@ source path.sh ...@@ -228,15 +246,18 @@ source path.sh
FLAGS_allocator_strategy=naive_best_fit \ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize_e2e.py \ python3 ${BIN_DIR}/../synthesize_e2e.py \
--fastspeech2-config=fastspeech2_nosil_baker_ckpt_0.4/default.yaml \ --am=fastspeech2_csmsc \
--fastspeech2-checkpoint=fastspeech2_nosil_baker_ckpt_0.4/snapshot_iter_76000.pdz \ --am_config=fastspeech2_nosil_baker_ckpt_0.4/default.yaml \
--fastspeech2-stat=fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy \ --am_ckpt=fastspeech2_nosil_baker_ckpt_0.4/snapshot_iter_76000.pdz \
--pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \ --am_stat=fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy \
--pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \ --voc=pwgan_csmsc \
--pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \ --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
--voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
--voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \ --text=${BIN_DIR}/../sentences.txt \
--output-dir=exp/default/test_e2e \ --output_dir=exp/default/test_e2e \
--inference-dir=exp/default/inference \ --inference_dir=exp/default/inference \
--phones-dict=fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt --phones_dict=dump/phone_id_map.txt
``` ```
...@@ -3,9 +3,9 @@ ...@@ -3,9 +3,9 @@
########################################################### ###########################################################
fs: 24000 # sr fs: 24000 # sr
n_fft: 2048 # FFT size. n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size. n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length. win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size. # If set to null, it will be the same as fft_size.
window: "hann" # Window function. window: "hann" # Window function.
......
...@@ -3,9 +3,9 @@ ...@@ -3,9 +3,9 @@
########################################################### ###########################################################
fs: 24000 # sr fs: 24000 # sr
n_fft: 2048 # FFT size. n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size. n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length. win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size. # If set to null, it will be the same as fft_size.
window: "hann" # Window function. window: "hann" # Window function.
......
...@@ -2,8 +2,50 @@ ...@@ -2,8 +2,50 @@
train_output_path=$1 train_output_path=$1
python3 ${BIN_DIR}/inference.py \ stage=0
--inference-dir=${train_output_path}/inference \ stop_stage=0
--text=${BIN_DIR}/../sentences.txt \
--output-dir=${train_output_path}/pd_infer_out \ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
--phones-dict=dump/phone_id_map.txt python3 ${BIN_DIR}/../inference.py \
--inference_dir=${train_output_path}/inference \
--am=fastspeech2_csmsc \
--voc=pwgan_csmsc \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/pd_infer_out \
--phones_dict=dump/phone_id_map.txt
fi
# for more GAN Vocoders
# multi band melgan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
python3 ${BIN_DIR}/../inference.py \
--inference_dir=${train_output_path}/inference \
--am=fastspeech2_csmsc \
--voc=mb_melgan_csmsc \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/pd_infer_out \
--phones_dict=dump/phone_id_map.txt
fi
# style melgan
# style melgan's Dygraph to Static Graph is not ready now
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
python3 ${BIN_DIR}/../inference.py \
--inference_dir=${train_output_path}/inference \
--am=fastspeech2_csmsc \
--voc=style_melgan_csmsc \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/pd_infer_out \
--phones_dict=dump/phone_id_map.txt
fi
# hifigan
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
python3 ${BIN_DIR}/../inference.py \
--inference_dir=${train_output_path}/inference \
--am=fastspeech2_csmsc \
--voc=hifigan_csmsc \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/pd_infer_out \
--phones_dict=dump/phone_id_map.txt
fi
\ No newline at end of file
...@@ -6,13 +6,15 @@ ckpt_name=$3 ...@@ -6,13 +6,15 @@ ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize.py \ python3 ${BIN_DIR}/../synthesize.py \
--fastspeech2-config=${config_path} \ --am=fastspeech2_csmsc \
--fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ --am_config=${config_path} \
--fastspeech2-stat=dump/train/speech_stats.npy \ --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \ --am_stat=dump/train/speech_stats.npy \
--pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \ --voc=pwgan_csmsc \
--pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \ --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
--test-metadata=dump/test/norm/metadata.jsonl \ --voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
--output-dir=${train_output_path}/test \ --voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
--phones-dict=dump/phone_id_map.txt --test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt
...@@ -4,16 +4,88 @@ config_path=$1 ...@@ -4,16 +4,88 @@ config_path=$1
train_output_path=$2 train_output_path=$2
ckpt_name=$3 ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \ stage=0
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ stop_stage=0
python3 ${BIN_DIR}/synthesize_e2e.py \
--fastspeech2-config=${config_path} \ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
--fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ FLAGS_allocator_strategy=naive_best_fit \
--fastspeech2-stat=dump/train/speech_stats.npy \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
--pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \ python3 ${BIN_DIR}/../synthesize_e2e.py \
--pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \ --am=fastspeech2_csmsc \
--pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \ --am_config=${config_path} \
--text=${BIN_DIR}/../sentences.txt \ --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--output-dir=${train_output_path}/test_e2e \ --am_stat=dump/train/speech_stats.npy \
--inference-dir=${train_output_path}/inference \ --voc=pwgan_csmsc \
--phones-dict=dump/phone_id_map.txt --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
--voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
--voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e \
--inference_dir=${train_output_path}/inference \
--phones_dict=dump/phone_id_map.txt
fi
# for more GAN Vocoders
# multi band melgan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=mb_melgan_csmsc \
--voc_config=mb_melgan_baker_finetune_ckpt_0.5/finetune.yaml \
--voc_ckpt=mb_melgan_baker_finetune_ckpt_0.5/snapshot_iter_2000000.pdz\
--voc_stat=mb_melgan_baker_finetune_ckpt_0.5/feats_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e \
--inference_dir=${train_output_path}/inference \
--phones_dict=dump/phone_id_map.txt
fi
# the pretrained models haven't release now
# style melgan
# style melgan's Dygraph to Static Graph is not ready now
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=style_melgan_csmsc \
--voc_config=style_melgan_test/default.yaml \
--voc_ckpt=style_melgan_test/snapshot_iter_935000.pdz \
--voc_stat=style_melgan_test/feats_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt
# --inference_dir=${train_output_path}/inference
fi
# hifigan
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
echo "in hifigan syn_e2e"
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_csmsc \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \
--voc=hifigan_csmsc \
--voc_config=hifigan_test/default.yaml \
--voc_ckpt=hifigan_test/snapshot_iter_1600000.pdz \
--voc_stat=hifigan_test/feats_stats.npy \
--lang=zh \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/test_e2e \
--inference_dir=${train_output_path}/inference \
--phones_dict=dump/phone_id_map.txt
fi
...@@ -2,11 +2,11 @@ ...@@ -2,11 +2,11 @@
This example contains code used to train a [parallel wavegan](http://arxiv.org/abs/1910.11480) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html). This example contains code used to train a [parallel wavegan](http://arxiv.org/abs/1910.11480) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html).
## Dataset ## Dataset
### Download and Extract ### Download and Extract
Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/BZNSYP`. Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
### Get MFA Result and Extract ### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
## Get Started ## Get Started
Assume the path to the dataset is `~/datasets/BZNSYP`. Assume the path to the dataset is `~/datasets/BZNSYP`.
...@@ -20,7 +20,7 @@ Run the command below to ...@@ -20,7 +20,7 @@ Run the command below to
```bash ```bash
./run.sh ./run.sh
``` ```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash ```bash
./run.sh --stage 0 --stop-stage 0 ./run.sh --stage 0 --stop-stage 0
``` ```
...@@ -43,9 +43,9 @@ dump ...@@ -43,9 +43,9 @@ dump
├── raw ├── raw
└── feats_stats.npy └── feats_stats.npy
``` ```
The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains log magnitude of mel spectrogram of each utterances, while the norm folder contains normalized spectrogram. The statistics used to normalize the spectrogram is computed from the training set, which is located in `dump/train/feats_stats.npy`. The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance. Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
### Model Training ### Model Training
```bash ```bash
...@@ -91,7 +91,7 @@ benchmark: ...@@ -91,7 +91,7 @@ benchmark:
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. 3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
### Synthesizing ### Synthesizing
...@@ -100,15 +100,19 @@ benchmark: ...@@ -100,15 +100,19 @@ benchmark:
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
``` ```
```text ```text
usage: synthesize.py [-h] [--config CONFIG] [--checkpoint CHECKPOINT] usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
[--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR] [--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
[--ngpu NGPU] [--verbose VERBOSE] [--output-dir OUTPUT_DIR] [--ngpu NGPU]
[--verbose VERBOSE]
Synthesize with parallel wavegan. Synthesize with GANVocoder.
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--config CONFIG parallel wavegan config file. --generator-type GENERATOR_TYPE
type of GANVocoder, should in {pwgan, mb_melgan,
style_melgan, } now
--config CONFIG GANVocoder config file.
--checkpoint CHECKPOINT --checkpoint CHECKPOINT
snapshot to load. snapshot to load.
--test-metadata TEST_METADATA --test-metadata TEST_METADATA
...@@ -126,9 +130,9 @@ optional arguments: ...@@ -126,9 +130,9 @@ optional arguments:
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Models ## Pretrained Models
Pretrained model can be downloaded here [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip). The pretrained model can be downloaded here [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip).
Static model can be downloaded here [pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip). The static model can be downloaded here [pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip).
Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss:| eval/spectral_convergence_loss Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss:| eval/spectral_convergence_loss
:-------------:| :------------:| :-----: | :-----: | :--------: :-------------:| :------------:| :-----: | :-----: | :--------:
......
...@@ -7,9 +7,9 @@ ...@@ -7,9 +7,9 @@
# FEATURE EXTRACTION SETTING # # FEATURE EXTRACTION SETTING #
########################################################### ###########################################################
fs: 24000 # Sampling rate. fs: 24000 # Sampling rate.
n_fft: 2048 # FFT size. (in samples) n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size. (in samples) n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length. (in samples) win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size. # If set to null, it will be the same as fft_size.
window: "hann" # Window function. window: "hann" # Window function.
n_mels: 80 # Number of mel basis. n_mels: 80 # Number of mel basis.
...@@ -56,9 +56,9 @@ discriminator_params: ...@@ -56,9 +56,9 @@ discriminator_params:
bias: true # Whether to use bias parameter in conv. bias: true # Whether to use bias parameter in conv.
use_weight_norm: true # Whether to use weight norm. use_weight_norm: true # Whether to use weight norm.
# If set to true, it will be applied to all of the conv layers. # If set to true, it will be applied to all of the conv layers.
nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv. nonlinear_activation: "leakyrelu" # Nonlinear function after each conv.
nonlinear_activation_params: # Nonlinear function parameters nonlinear_activation_params: # Nonlinear function parameters
negative_slope: 0.2 # Alpha in LeakyReLU. negative_slope: 0.2 # Alpha in leakyrelu.
########################################################### ###########################################################
# STFT LOSS SETTING # # STFT LOSS SETTING #
......
...@@ -7,8 +7,8 @@ ckpt_name=$3 ...@@ -7,8 +7,8 @@ ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \ python3 ${BIN_DIR}/../synthesize.py \
--config=${config_path} \ --config=${config_path} \
--checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
--test-metadata=dump/test/norm/metadata.jsonl \ --test-metadata=dump/test/norm/metadata.jsonl \
--output-dir=${train_output_path}/test \ --output-dir=${train_output_path}/test \
--generator-type=pwgan --generator-type=pwgan
...@@ -2,11 +2,11 @@ ...@@ -2,11 +2,11 @@
This example contains code used to train a [Multi Band MelGAN](https://arxiv.org/abs/2005.05106) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html). This example contains code used to train a [Multi Band MelGAN](https://arxiv.org/abs/2005.05106) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html).
## Dataset ## Dataset
### Download and Extract ### Download and Extract
Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/BZNSYP`. Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
### Get MFA Result and Extract ### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut the silence in the edge of audio.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/mfa) of our repo. You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/mfa) of our repo.
## Get Started ## Get Started
Assume the path to the dataset is `~/datasets/BZNSYP`. Assume the path to the dataset is `~/datasets/BZNSYP`.
...@@ -20,7 +20,7 @@ Run the command below to ...@@ -20,7 +20,7 @@ Run the command below to
```bash ```bash
./run.sh ./run.sh
``` ```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash ```bash
./run.sh --stage 0 --stop-stage 0 ./run.sh --stage 0 --stop-stage 0
``` ```
...@@ -43,9 +43,9 @@ dump ...@@ -43,9 +43,9 @@ dump
├── raw ├── raw
└── feats_stats.npy └── feats_stats.npy
``` ```
The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains log magnitude of mel spectrogram of each utterances, while the norm folder contains normalized spectrogram. The statistics used to normalize the spectrogram is computed from the training set, which is located in `dump/train/feats_stats.npy`. The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance. Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
### Model Training ### Model Training
```bash ```bash
...@@ -76,7 +76,7 @@ optional arguments: ...@@ -76,7 +76,7 @@ optional arguments:
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. 3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
### Synthesizing ### Synthesizing
...@@ -85,15 +85,19 @@ optional arguments: ...@@ -85,15 +85,19 @@ optional arguments:
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
``` ```
```text ```text
usage: synthesize.py [-h] [--config CONFIG] [--checkpoint CHECKPOINT] usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
[--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR] [--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
[--ngpu NGPU] [--verbose VERBOSE] [--output-dir OUTPUT_DIR] [--ngpu NGPU]
[--verbose VERBOSE]
Synthesize with multi band melgan. Synthesize with GANVocoder.
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--config CONFIG multi band melgan config file. --generator-type GENERATOR_TYPE
type of GANVocoder, should in {pwgan, mb_melgan,
style_melgan, } now
--config CONFIG GANVocoder config file.
--checkpoint CHECKPOINT --checkpoint CHECKPOINT
snapshot to load. snapshot to load.
--test-metadata TEST_METADATA --test-metadata TEST_METADATA
...@@ -111,22 +115,22 @@ optional arguments: ...@@ -111,22 +115,22 @@ optional arguments:
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Fine-tuning ## Fine-tuning
Since there are no `noise` in the input of Multi Band MelGAN, the audio quality is not so good (see [espnet issue](https://github.com/espnet/espnet/issues/3536#issuecomment-916035415)), we refer to the method proposed in [HiFiGAN](https://arxiv.org/abs/2010.05646), finetune Multi Band MelGAN with the predicted mel-spectrogram from `FastSpeech2`. Since there is no `noise` in the input of Multi-Band MelGAN, the audio quality is not so good (see [espnet issue](https://github.com/espnet/espnet/issues/3536#issuecomment-916035415)), we refer to the method proposed in [HiFiGAN](https://arxiv.org/abs/2010.05646), finetune Multi-Band MelGAN with the predicted mel-spectrogram from `FastSpeech2`.
The length of mel-spectrograms should align with the length of wavs, so we should generate mels using ground truth alignment. The length of mel-spectrograms should align with the length of wavs, so we should generate mels using ground truth alignment.
But since we are fine-tuning, we should use the statistics computed during training step. But since we are fine-tuning, we should use the statistics computed during the training step.
You should first download pretrained `FastSpeech2` model from [fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip) and `unzip` it. You should first download pretrained `FastSpeech2` model from [fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip) and `unzip` it.
Assume the path to the dump-dir of training step is `dump`. Assume the path to the dump-dir of training step is `dump`.
Assume the path to the duration result of CSMSC is `durations.txt` (generated during training step's preprocessing). Assume the path to the duration result of CSMSC is `durations.txt` (generated during the training step's preprocessing).
Assume the path to the pretrained `FastSpeech2` model is `fastspeech2_nosil_baker_ckpt_0.4`. Assume the path to the pretrained `FastSpeech2` model is `fastspeech2_nosil_baker_ckpt_0.4`.
\ \
The `finetune.sh` can The `finetune.sh` can
1. **source path**. 1. **source path**.
2. generate ground truth alignment mels. 2. generate ground truth alignment mels.
3. link `*_wave.npy` from `dump` to `dump_finetune` (because we only use new mels, the wavs are the ones used during train step) . 3. link `*_wave.npy` from `dump` to `dump_finetune` (because we only use new mels, the wavs are the ones used during the training step).
4. copy features' stats from `dump` to `dump_finetune`. 4. copy features' stats from `dump` to `dump_finetune`.
5. normalize the ground truth alignment mels. 5. normalize the ground truth alignment mels.
6. finetune the model. 6. finetune the model.
...@@ -137,9 +141,9 @@ exp/finetune/checkpoints ...@@ -137,9 +141,9 @@ exp/finetune/checkpoints
├── records.jsonl ├── records.jsonl
└── snapshot_iter_1000000.pdz └── snapshot_iter_1000000.pdz
``` ```
The content of `records.jsonl` should be as follows (change `"path"` to your own ckpt path): The content of `records.jsonl` should be as follows (change `"path"` to your ckpt path):
``` ```
{"time": "2021-11-21 15:11:20.337311", "path": "~/PaddleSpeech/examples/csmsc/voc3/exp/finetune/checkpoints/snapshot_iter_1000000.pdz", "iteration": 1000000} {"time": "2021-11-21 15:11:20.337311", "path": "~/PaddleSpeech/examples/csmsc/voc3/exp/finetune/checkpoints/snapshot_iter_1000000.pdz", "iteration": 1000000}
``` ```
Run the command below Run the command below
```bash ```bash
...@@ -151,11 +155,11 @@ TODO: ...@@ -151,11 +155,11 @@ TODO:
The hyperparameter of `finetune.yaml` is not good enough, a smaller `learning_rate` should be used (more `milestones` should be set). The hyperparameter of `finetune.yaml` is not good enough, a smaller `learning_rate` should be used (more `milestones` should be set).
## Pretrained Models ## Pretrained Models
Pretrained model can be downloaded here [mb_melgan_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_ckpt_0.5.zip). The pretrained model can be downloaded here [mb_melgan_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_ckpt_0.5.zip).
Finetuned model can ben downloaded here [mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip). The finetuned model can be downloaded here [mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip).
Static model can be downloaded here [mb_melgan_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_static_0.5.zip) The static model can be downloaded here [mb_melgan_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_static_0.5.zip)
Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss|eval/spectral_convergence_loss |eval/sub_log_stft_magnitude_loss|eval/sub_spectral_convergence_loss Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss|eval/spectral_convergence_loss |eval/sub_log_stft_magnitude_loss|eval/sub_spectral_convergence_loss
:-------------:| :------------:| :-----: | :-----: | :--------:| :--------:| :--------: :-------------:| :------------:| :-----: | :-----: | :--------:| :--------:| :--------:
......
...@@ -12,9 +12,9 @@ ...@@ -12,9 +12,9 @@
# FEATURE EXTRACTION SETTING # # FEATURE EXTRACTION SETTING #
########################################################### ###########################################################
fs: 24000 # Sampling rate. fs: 24000 # Sampling rate.
n_fft: 2048 # FFT size. (in samples) n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size. (in samples) n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length. (in samples) win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size. # If set to null, it will be the same as fft_size.
window: "hann" # Window function. window: "hann" # Window function.
n_mels: 80 # Number of mel basis. n_mels: 80 # Number of mel basis.
...@@ -54,7 +54,7 @@ discriminator_params: ...@@ -54,7 +54,7 @@ discriminator_params:
channels: 16 # Number of channels of the initial conv layer. channels: 16 # Number of channels of the initial conv layer.
max_downsample_channels: 512 # Maximum number of channels of downsampling layers. max_downsample_channels: 512 # Maximum number of channels of downsampling layers.
downsample_scales: [4, 4, 4] # List of downsampling scales. downsample_scales: [4, 4, 4] # List of downsampling scales.
nonlinear_activation: "LeakyReLU" # Nonlinear activation function. nonlinear_activation: "leakyrelu" # Nonlinear activation function.
nonlinear_activation_params: # Parameters of nonlinear activation function. nonlinear_activation_params: # Parameters of nonlinear activation function.
negative_slope: 0.2 negative_slope: 0.2
use_weight_norm: True # Whether to use weight norm. use_weight_norm: True # Whether to use weight norm.
......
...@@ -12,9 +12,9 @@ ...@@ -12,9 +12,9 @@
# FEATURE EXTRACTION SETTING # # FEATURE EXTRACTION SETTING #
########################################################### ###########################################################
fs: 24000 # Sampling rate. fs: 24000 # Sampling rate.
n_fft: 2048 # FFT size. (in samples) n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size. (in samples) n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length. (in samples) win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size. # If set to null, it will be the same as fft_size.
window: "hann" # Window function. window: "hann" # Window function.
n_mels: 80 # Number of mel basis. n_mels: 80 # Number of mel basis.
...@@ -54,7 +54,7 @@ discriminator_params: ...@@ -54,7 +54,7 @@ discriminator_params:
channels: 16 # Number of channels of the initial conv layer. channels: 16 # Number of channels of the initial conv layer.
max_downsample_channels: 512 # Maximum number of channels of downsampling layers. max_downsample_channels: 512 # Maximum number of channels of downsampling layers.
downsample_scales: [4, 4, 4] # List of downsampling scales. downsample_scales: [4, 4, 4] # List of downsampling scales.
nonlinear_activation: "LeakyReLU" # Nonlinear activation function. nonlinear_activation: "leakyrelu" # Nonlinear activation function.
nonlinear_activation_params: # Parameters of nonlinear activation function. nonlinear_activation_params: # Parameters of nonlinear activation function.
negative_slope: 0.2 negative_slope: 0.2
use_weight_norm: True # Whether to use weight norm. use_weight_norm: True # Whether to use weight norm.
......
...@@ -7,8 +7,8 @@ ckpt_name=$3 ...@@ -7,8 +7,8 @@ ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \ python3 ${BIN_DIR}/../synthesize.py \
--config=${config_path} \ --config=${config_path} \
--checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
--test-metadata=dump/test/norm/metadata.jsonl \ --test-metadata=dump/test/norm/metadata.jsonl \
--output-dir=${train_output_path}/test \ --output-dir=${train_output_path}/test \
--generator-type=mb_melgan --generator-type=mb_melgan
...@@ -2,11 +2,11 @@ ...@@ -2,11 +2,11 @@
This example contains code used to train a [Style MelGAN](https://arxiv.org/abs/2011.01557) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html). This example contains code used to train a [Style MelGAN](https://arxiv.org/abs/2011.01557) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html).
## Dataset ## Dataset
### Download and Extract ### Download and Extract
Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/BZNSYP`. Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
### Get MFA Result and Extract ### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut the silence in the edge of audio.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/mfa) of our repo. You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/mfa) of our repo.
## Get Started ## Get Started
Assume the path to the dataset is `~/datasets/BZNSYP`. Assume the path to the dataset is `~/datasets/BZNSYP`.
...@@ -20,7 +20,7 @@ Run the command below to ...@@ -20,7 +20,7 @@ Run the command below to
```bash ```bash
./run.sh ./run.sh
``` ```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash ```bash
./run.sh --stage 0 --stop-stage 0 ./run.sh --stage 0 --stop-stage 0
``` ```
...@@ -43,9 +43,9 @@ dump ...@@ -43,9 +43,9 @@ dump
├── raw ├── raw
└── feats_stats.npy └── feats_stats.npy
``` ```
The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains log magnitude of mel spectrogram of each utterances, while the norm folder contains normalized spectrogram. The statistics used to normalize the spectrogram is computed from the training set, which is located in `dump/train/feats_stats.npy`. The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance. Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
### Model Training ### Model Training
```bash ```bash
...@@ -76,7 +76,7 @@ optional arguments: ...@@ -76,7 +76,7 @@ optional arguments:
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. 3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
### Synthesizing ### Synthesizing
...@@ -85,15 +85,19 @@ optional arguments: ...@@ -85,15 +85,19 @@ optional arguments:
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
``` ```
```text ```text
usage: synthesize.py [-h] [--config CONFIG] [--checkpoint CHECKPOINT] usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
[--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR] [--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
[--ngpu NGPU] [--verbose VERBOSE] [--output-dir OUTPUT_DIR] [--ngpu NGPU]
[--verbose VERBOSE]
Synthesize with multi band melgan. Synthesize with GANVocoder.
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--config CONFIG multi band melgan config file. --generator-type GENERATOR_TYPE
type of GANVocoder, should in {pwgan, mb_melgan,
style_melgan, } now
--config CONFIG GANVocoder config file.
--checkpoint CHECKPOINT --checkpoint CHECKPOINT
snapshot to load. snapshot to load.
--test-metadata TEST_METADATA --test-metadata TEST_METADATA
...@@ -104,7 +108,7 @@ optional arguments: ...@@ -104,7 +108,7 @@ optional arguments:
--verbose VERBOSE verbose. --verbose VERBOSE verbose.
``` ```
1. `--config` multi band melgan config file. You should use the same config with which the model is trained. 1. `--config` style melgan config file. You should use the same config with which the model is trained.
2. `--checkpoint` is the checkpoint to load. Pick one of the checkpoints from `checkpoints` inside the training output directory. 2. `--checkpoint` is the checkpoint to load. Pick one of the checkpoints from `checkpoints` inside the training output directory.
3. `--test-metadata` is the metadata of the test dataset. Use the `metadata.jsonl` in the `dev/norm` subfolder from the processed directory. 3. `--test-metadata` is the metadata of the test dataset. Use the `metadata.jsonl` in the `dev/norm` subfolder from the processed directory.
4. `--output-dir` is the directory to save the synthesized audio files. 4. `--output-dir` is the directory to save the synthesized audio files.
......
...@@ -9,9 +9,9 @@ ...@@ -9,9 +9,9 @@
# FEATURE EXTRACTION SETTING # # FEATURE EXTRACTION SETTING #
########################################################### ###########################################################
fs: 24000 # Sampling rate. fs: 24000 # Sampling rate.
n_fft: 2048 # FFT size. (in samples) n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size. (in samples) n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length. (in samples) win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size. # If set to null, it will be the same as fft_size.
window: "hann" # Window function. window: "hann" # Window function.
n_mels: 80 # Number of mel basis. n_mels: 80 # Number of mel basis.
......
...@@ -7,8 +7,8 @@ ckpt_name=$3 ...@@ -7,8 +7,8 @@ ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \ python3 ${BIN_DIR}/../synthesize.py \
--config=${config_path} \ --config=${config_path} \
--checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
--test-metadata=dump/test/norm/metadata.jsonl \ --test-metadata=dump/test/norm/metadata.jsonl \
--output-dir=${train_output_path}/test \ --output-dir=${train_output_path}/test \
--generator-type=style_melgan --generator-type=style_melgan
# HiFiGAN with CSMSC
This example contains code used to train a [HiFiGAN](https://arxiv.org/abs/2010.05646) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html).
## Dataset
### Download and Extract
Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/mfa) of our repo.
## Get Started
Assume the path to the dataset is `~/datasets/BZNSYP`.
Assume the path to the MFA result of CSMSC is `./baker_alignment_tone`.
Run the command below to
1. **source path**.
2. preprocess the dataset.
3. train the model.
4. synthesize wavs.
- synthesize waveform from `metadata.jsonl`.
```bash
./run.sh
```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash
./run.sh --stage 0 --stop-stage 0
```
### Data Preprocessing
```bash
./local/preprocess.sh ${conf_path}
```
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
```text
dump
├── dev
│ ├── norm
│ └── raw
├── test
│ ├── norm
│ └── raw
└── train
├── norm
├── raw
└── feats_stats.npy
```
The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
### Model Training
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
```
`./local/train.sh` calls `${BIN_DIR}/train.py`.
Here's the complete help message.
```text
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
[--ngpu NGPU] [--verbose VERBOSE]
Train a HiFiGAN model.
optional arguments:
-h, --help show this help message and exit
--config CONFIG config file to overwrite default config.
--train-metadata TRAIN_METADATA
training data.
--dev-metadata DEV_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu == 0, use cpu.
--verbose VERBOSE verbose.
```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
### Synthesizing
`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
```
```text
usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
[--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
[--output-dir OUTPUT_DIR] [--ngpu NGPU]
[--verbose VERBOSE]
Synthesize with GANVocoder.
optional arguments:
-h, --help show this help message and exit
--generator-type GENERATOR_TYPE
type of GANVocoder, should in {pwgan, mb_melgan,
style_melgan, } now
--config CONFIG GANVocoder config file.
--checkpoint CHECKPOINT
snapshot to load.
--test-metadata TEST_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu == 0, use cpu.
--verbose VERBOSE verbose.
```
1. `--config` config file. You should use the same config with which the model is trained.
2. `--checkpoint` is the checkpoint to load. Pick one of the checkpoints from `checkpoints` inside the training output directory.
3. `--test-metadata` is the metadata of the test dataset. Use the `metadata.jsonl` in the `dev/norm` subfolder from the processed directory.
4. `--output-dir` is the directory to save the synthesized audio files.
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Fine-tuning
# This is the configuration file for CSMSC dataset.
# This configuration is based on HiFiGAN V1, which is an official configuration.
# But I found that the optimizer setting does not work well with my implementation.
# So I changed optimizer settings as follows:
# - AdamW -> Adam
# - betas: [0.8, 0.99] -> betas: [0.5, 0.9]
# - Scheduler: ExponentialLR -> MultiStepLR
# To match the shift size difference, the upsample scales is also modified from the original 256 shift setting.
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
fs: 24000 # Sampling rate.
n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
n_mels: 80 # Number of mel basis.
fmin: 80 # Minimum freq in mel basis calculation. (Hz)
fmax: 7600 # Maximum frequency in mel basis calculation. (Hz)
###########################################################
# GENERATOR NETWORK ARCHITECTURE SETTING #
###########################################################
generator_params:
in_channels: 80 # Number of input channels.
out_channels: 1 # Number of output channels.
channels: 512 # Number of initial channels.
kernel_size: 7 # Kernel size of initial and final conv layers.
upsample_scales: [5, 5, 4, 3] # Upsampling scales.
upsample_kernel_sizes: [10, 10, 8, 6] # Kernel size for upsampling layers.
resblock_kernel_sizes: [3, 7, 11] # Kernel size for residual blocks.
resblock_dilations: # Dilations for residual blocks.
- [1, 3, 5]
- [1, 3, 5]
- [1, 3, 5]
use_additional_convs: true # Whether to use additional conv layer in residual blocks.
bias: true # Whether to use bias parameter in conv.
nonlinear_activation: "leakyrelu" # Nonlinear activation type.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: true # Whether to apply weight normalization.
###########################################################
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING #
###########################################################
discriminator_params:
scales: 3 # Number of multi-scale discriminator.
scale_downsample_pooling: "AvgPool1D" # Pooling operation for scale discriminator.
scale_downsample_pooling_params:
kernel_size: 4 # Pooling kernel size.
stride: 2 # Pooling stride.
padding: 2 # Padding size.
scale_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [15, 41, 5, 3] # List of kernel sizes.
channels: 128 # Initial number of channels.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
max_groups: 16 # Maximum number of groups in downsampling conv layers.
bias: true
downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales.
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params:
negative_slope: 0.1
follow_official_norm: true # Whether to follow the official norm setting.
periods: [2, 3, 5, 7, 11] # List of period for multi-period discriminator.
period_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [5, 3] # List of kernel sizes.
channels: 32 # Initial number of channels.
downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
bias: true # Whether to use bias parameter in conv layer."
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: true # Whether to apply weight normalization.
use_spectral_norm: false # Whether to apply spectral normalization.
###########################################################
# STFT LOSS SETTING #
###########################################################
use_stft_loss: false # Whether to use multi-resolution STFT loss.
use_mel_loss: true # Whether to use Mel-spectrogram loss.
mel_loss_params:
fs: 24000
fft_size: 2048
hop_size: 300
win_length: 1200
window: "hann"
num_mels: 80
fmin: 0
fmax: 12000
log_base: null
generator_adv_loss_params:
average_by_discriminators: false # Whether to average loss by #discriminators.
discriminator_adv_loss_params:
average_by_discriminators: false # Whether to average loss by #discriminators.
use_feat_match_loss: true
feat_match_loss_params:
average_by_discriminators: false # Whether to average loss by #discriminators.
average_by_layers: false # Whether to average loss by #layers in each discriminator.
include_final_outputs: false # Whether to include final outputs in feat match loss calculation.
###########################################################
# ADVERSARIAL LOSS SETTING #
###########################################################
lambda_aux: 45.0 # Loss balancing coefficient for STFT loss.
lambda_adv: 1.0 # Loss balancing coefficient for adversarial loss.
lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss..
###########################################################
# DATA LOADER SETTING #
###########################################################
batch_size: 16 # Batch size.
batch_max_steps: 8400 # Length of each audio in batch. Make sure dividable by hop_size.
num_workers: 2 # Number of workers in Pytorch DataLoader.
###########################################################
# OPTIMIZER & SCHEDULER SETTING #
###########################################################
generator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Generator's weight decay coefficient.
generator_scheduler_params:
learning_rate: 2.0e-4 # Generator's learning rate.
gamma: 0.5 # Generator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
generator_grad_norm: -1 # Generator's gradient norm.
discriminator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Discriminator's weight decay coefficient.
discriminator_scheduler_params:
learning_rate: 2.0e-4 # Discriminator's learning rate.
gamma: 0.5 # Discriminator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
discriminator_grad_norm: -1 # Discriminator's gradient norm.
###########################################################
# INTERVAL SETTING #
###########################################################
generator_train_start_steps: 1 # Number of steps to start to train discriminator.
discriminator_train_start_steps: 0 # Number of steps to start to train discriminator.
train_max_steps: 2500000 # Number of training steps.
save_interval_steps: 5000 # Interval steps to save checkpoint.
eval_interval_steps: 1000 # Interval steps to evaluate the network.
###########################################################
# OTHER SETTING #
###########################################################
num_snapshots: 10 # max number of snapshots to keep while training
seed: 42 # random seed for paddle, random, and np.random
# This is the configuration file for CSMSC dataset.
# This configuration is based on HiFiGAN V1, which is an official configuration.
# But I found that the optimizer setting does not work well with my implementation.
# So I changed optimizer settings as follows:
# - AdamW -> Adam
# - betas: [0.8, 0.99] -> betas: [0.5, 0.9]
# - Scheduler: ExponentialLR -> MultiStepLR
# To match the shift size difference, the upsample scales is also modified from the original 256 shift setting.
###########################################################
# FEATURE EXTRACTION SETTING #
###########################################################
fs: 24000 # Sampling rate.
n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size.
window: "hann" # Window function.
n_mels: 80 # Number of mel basis.
fmin: 80 # Minimum freq in mel basis calculation. (Hz)
fmax: 7600 # Maximum frequency in mel basis calculation. (Hz)
###########################################################
# GENERATOR NETWORK ARCHITECTURE SETTING #
###########################################################
generator_params:
in_channels: 80 # Number of input channels.
out_channels: 1 # Number of output channels.
channels: 512 # Number of initial channels.
kernel_size: 7 # Kernel size of initial and final conv layers.
upsample_scales: [5, 5, 4, 3] # Upsampling scales.
upsample_kernel_sizes: [10, 10, 8, 6] # Kernel size for upsampling layers.
resblock_kernel_sizes: [3, 7, 11] # Kernel size for residual blocks.
resblock_dilations: # Dilations for residual blocks.
- [1, 3, 5]
- [1, 3, 5]
- [1, 3, 5]
use_additional_convs: true # Whether to use additional conv layer in residual blocks.
bias: true # Whether to use bias parameter in conv.
nonlinear_activation: "leakyrelu" # Nonlinear activation type.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: true # Whether to apply weight normalization.
###########################################################
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING #
###########################################################
discriminator_params:
scales: 3 # Number of multi-scale discriminator.
scale_downsample_pooling: "AvgPool1D" # Pooling operation for scale discriminator.
scale_downsample_pooling_params:
kernel_size: 4 # Pooling kernel size.
stride: 2 # Pooling stride.
padding: 2 # Padding size.
scale_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [15, 41, 5, 3] # List of kernel sizes.
channels: 128 # Initial number of channels.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
max_groups: 16 # Maximum number of groups in downsampling conv layers.
bias: true
downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales.
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params:
negative_slope: 0.1
follow_official_norm: true # Whether to follow the official norm setting.
periods: [2, 3, 5, 7, 11] # List of period for multi-period discriminator.
period_discriminator_params:
in_channels: 1 # Number of input channels.
out_channels: 1 # Number of output channels.
kernel_sizes: [5, 3] # List of kernel sizes.
channels: 32 # Initial number of channels.
downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales.
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
bias: true # Whether to use bias parameter in conv layer."
nonlinear_activation: "leakyrelu" # Nonlinear activation.
nonlinear_activation_params: # Nonlinear activation paramters.
negative_slope: 0.1
use_weight_norm: true # Whether to apply weight normalization.
use_spectral_norm: false # Whether to apply spectral normalization.
###########################################################
# STFT LOSS SETTING #
###########################################################
use_stft_loss: false # Whether to use multi-resolution STFT loss.
use_mel_loss: true # Whether to use Mel-spectrogram loss.
mel_loss_params:
fs: 24000
fft_size: 2048
hop_size: 300
win_length: 1200
window: "hann"
num_mels: 80
fmin: 0
fmax: 12000
log_base: null
generator_adv_loss_params:
average_by_discriminators: false # Whether to average loss by #discriminators.
discriminator_adv_loss_params:
average_by_discriminators: false # Whether to average loss by #discriminators.
use_feat_match_loss: true
feat_match_loss_params:
average_by_discriminators: false # Whether to average loss by #discriminators.
average_by_layers: false # Whether to average loss by #layers in each discriminator.
include_final_outputs: false # Whether to include final outputs in feat match loss calculation.
###########################################################
# ADVERSARIAL LOSS SETTING #
###########################################################
lambda_aux: 45.0 # Loss balancing coefficient for STFT loss.
lambda_adv: 1.0 # Loss balancing coefficient for adversarial loss.
lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss..
###########################################################
# DATA LOADER SETTING #
###########################################################
batch_size: 16 # Batch size.
batch_max_steps: 8400 # Length of each audio in batch. Make sure dividable by hop_size.
num_workers: 2 # Number of workers in Pytorch DataLoader.
###########################################################
# OPTIMIZER & SCHEDULER SETTING #
###########################################################
generator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Generator's weight decay coefficient.
generator_scheduler_params:
learning_rate: 2.0e-4 # Generator's learning rate.
gamma: 0.5 # Generator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
generator_grad_norm: -1 # Generator's gradient norm.
discriminator_optimizer_params:
beta1: 0.5
beta2: 0.9
weight_decay: 0.0 # Discriminator's weight decay coefficient.
discriminator_scheduler_params:
learning_rate: 2.0e-4 # Discriminator's learning rate.
gamma: 0.5 # Discriminator's scheduler gamma.
milestones: # At each milestone, lr will be multiplied by gamma.
- 200000
- 400000
- 600000
- 800000
discriminator_grad_norm: -1 # Discriminator's gradient norm.
###########################################################
# INTERVAL SETTING #
###########################################################
generator_train_start_steps: 1 # Number of steps to start to train discriminator.
discriminator_train_start_steps: 0 # Number of steps to start to train discriminator.
train_max_steps: 2500000 # Number of training steps.
save_interval_steps: 10000 # Interval steps to save checkpoint.
eval_interval_steps: 1000 # Interval steps to evaluate the network.
log_interval_steps: 100 # Interval steps to record the training log.
###########################################################
# OTHER SETTING #
###########################################################
num_snapshots: 10 # max number of snapshots to keep while training
seed: 42 # random seed for paddle, random, and np.random
#!/bin/bash
source path.sh
gpus=0
stage=0
stop_stage=100
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
python3 ${MAIN_ROOT}/paddlespeech/t2s/exps/fastspeech2/gen_gta_mel.py \
--fastspeech2-config=fastspeech2_nosil_baker_ckpt_0.4/default.yaml \
--fastspeech2-checkpoint=fastspeech2_nosil_baker_ckpt_0.4/snapshot_iter_76000.pdz \
--fastspeech2-stat=fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy \
--dur-file=durations.txt \
--output-dir=dump_finetune \
--phones-dict=fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
python3 local/link_wav.py \
--old-dump-dir=dump \
--dump-dir=dump_finetune
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# get features' stats(mean and std)
echo "Get features' stats ..."
cp dump/train/feats_stats.npy dump_finetune/train/
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# normalize, dev and test should use train's stats
echo "Normalize ..."
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump_finetune/train/raw/metadata.jsonl \
--dumpdir=dump_finetune/train/norm \
--stats=dump_finetune/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump_finetune/dev/raw/metadata.jsonl \
--dumpdir=dump_finetune/dev/norm \
--stats=dump_finetune/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump_finetune/test/raw/metadata.jsonl \
--dumpdir=dump_finetune/test/norm \
--stats=dump_finetune/train/feats_stats.npy
fi
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
CUDA_VISIBLE_DEVICES=${gpus} \
FLAGS_cudnn_exhaustive_search=true \
FLAGS_conv_workspace_size_limit=4000 \
python ${BIN_DIR}/train.py \
--train-metadata=dump_finetune/train/norm/metadata.jsonl \
--dev-metadata=dump_finetune/dev/norm/metadata.jsonl \
--config=conf/finetune.yaml \
--output-dir=exp/finetune \
--ngpu=1
fi
\ No newline at end of file
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
from operator import itemgetter
from pathlib import Path
import jsonlines
import numpy as np
def main():
# parse config and args
parser = argparse.ArgumentParser(
description="Preprocess audio and then extract features .")
parser.add_argument(
"--old-dump-dir",
default=None,
type=str,
help="directory to dump feature files.")
parser.add_argument(
"--dump-dir",
type=str,
required=True,
help="directory to finetune dump feature files.")
args = parser.parse_args()
old_dump_dir = Path(args.old_dump_dir).expanduser()
old_dump_dir = old_dump_dir.resolve()
dump_dir = Path(args.dump_dir).expanduser()
# use absolute path
dump_dir = dump_dir.resolve()
dump_dir.mkdir(parents=True, exist_ok=True)
assert old_dump_dir.is_dir()
assert dump_dir.is_dir()
for sub in ["train", "dev", "test"]:
# 把 old_dump_dir 里面的 *-wave.npy 软连接到 dump_dir 的对应位置
output_dir = dump_dir / sub
output_dir.mkdir(parents=True, exist_ok=True)
results = []
for name in os.listdir(output_dir / "raw"):
# 003918_feats.npy
utt_id = name.split("_")[0]
mel_path = output_dir / ("raw/" + name)
gen_mel = np.load(mel_path)
wave_name = utt_id + "_wave.npy"
wav = np.load(old_dump_dir / sub / ("raw/" + wave_name))
os.symlink(old_dump_dir / sub / ("raw/" + wave_name),
output_dir / ("raw/" + wave_name))
num_sample = wav.shape[0]
num_frames = gen_mel.shape[0]
wav_path = output_dir / ("raw/" + wave_name)
record = {
"utt_id": utt_id,
"num_samples": num_sample,
"num_frames": num_frames,
"feats": str(mel_path),
"wave": str(wav_path),
}
results.append(record)
results.sort(key=itemgetter("utt_id"))
with jsonlines.open(output_dir / "raw/metadata.jsonl", 'w') as writer:
for item in results:
writer.write(item)
if __name__ == "__main__":
main()
#!/bin/bash
stage=0
stop_stage=100
config_path=$1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# get durations from MFA's result
echo "Generate durations.txt from MFA results ..."
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
--inputdir=./baker_alignment_tone \
--output=durations.txt \
--config=${config_path}
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# extract features
echo "Extract features ..."
python3 ${BIN_DIR}/../preprocess.py \
--rootdir=~/datasets/BZNSYP/ \
--dataset=baker \
--dumpdir=dump \
--dur-file=durations.txt \
--config=${config_path} \
--cut-sil=True \
--num-cpu=20
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# get features' stats(mean and std)
echo "Get features' stats ..."
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
--metadata=dump/train/raw/metadata.jsonl \
--field-name="feats"
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# normalize, dev and test should use train's stats
echo "Normalize ..."
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/train/raw/metadata.jsonl \
--dumpdir=dump/train/norm \
--stats=dump/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/dev/raw/metadata.jsonl \
--dumpdir=dump/dev/norm \
--stats=dump/train/feats_stats.npy
python3 ${BIN_DIR}/../normalize.py \
--metadata=dump/test/raw/metadata.jsonl \
--dumpdir=dump/test/norm \
--stats=dump/train/feats_stats.npy
fi
#!/bin/bash
config_path=$1
train_output_path=$2
ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--config=${config_path} \
--checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
--test-metadata=dump/test/norm/metadata.jsonl \
--output-dir=${train_output_path}/test \
--generator-type=hifigan
#!/bin/bash
config_path=$1
train_output_path=$2
FLAGS_cudnn_exhaustive_search=true \
FLAGS_conv_workspace_size_limit=4000 \
python ${BIN_DIR}/train.py \
--train-metadata=dump/train/norm/metadata.jsonl \
--dev-metadata=dump/dev/norm/metadata.jsonl \
--config=${config_path} \
--output-dir=${train_output_path} \
--ngpu=1
#!/bin/bash
export MAIN_ROOT=`realpath ${PWD}/../../../`
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
export PYTHONDONTWRITEBYTECODE=1
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
MODEL=hifigan
export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/gan_vocoder/${MODEL}
\ No newline at end of file
#!/bin/bash
set -e
source path.sh
gpus=0,1
stage=0
stop_stage=100
conf_path=conf/default.yaml
train_output_path=exp/default
ckpt_name=snapshot_iter_50000.pdz
# with the following command, you can choose the stage range you want to run
# such as `./run.sh --stage 0 --stop-stage 0`
# this can not be mixed use with `$1`, `$2` ...
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
./local/preprocess.sh ${conf_path} || exit -1
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# synthesize
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
fi
...@@ -30,4 +30,4 @@ train: Epoch 120, 4 V100-32G, 27 Day, best avg: 10 ...@@ -30,4 +30,4 @@ train: Epoch 120, 4 V100-32G, 27 Day, best avg: 10
| transformer | 32.52 M | conf/transformer.yaml | spec_aug | test-clean | attention | 6.382194232940674 | 0.049661 | | transformer | 32.52 M | conf/transformer.yaml | spec_aug | test-clean | attention | 6.382194232940674 | 0.049661 |
| transformer | 32.52 M | conf/transformer.yaml | spec_aug | test-clean | ctc_greedy_search | 6.382194232940674 | 0.049566 | | transformer | 32.52 M | conf/transformer.yaml | spec_aug | test-clean | ctc_greedy_search | 6.382194232940674 | 0.049566 |
| transformer | 32.52 M | conf/transformer.yaml | spec_aug | test-clean | ctc_prefix_beam_search | 6.382194232940674 | 0.049585 | | transformer | 32.52 M | conf/transformer.yaml | spec_aug | test-clean | ctc_prefix_beam_search | 6.382194232940674 | 0.049585 |
| transformer | 32.52 M | conf/transformer.yaml | spec_aug | test-clean | attention_rescoring | 6.382194232940674 | 0.038135 | | transformer | 32.52 M | conf/transformer.yaml | spec_aug | test-clean | attention_rescoring | 6.382194232940674 | 0.038135 |
# Tacotron2 with LJSpeech # Tacotron2 with LJSpeech
PaddlePaddle dynamic graph implementation of Tacotron2, a neural network architecture for speech synthesis directly from text. The implementation is based on [Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](https://arxiv.org/abs/1712.05884). PaddlePaddle dynamic graph implementation of Tacotron2, a neural network architecture for speech synthesis directly from the text. The implementation is based on [Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](https://arxiv.org/abs/1712.05884).
## Dataset ## Dataset
We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/). We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
...@@ -18,7 +18,7 @@ Run the command below to ...@@ -18,7 +18,7 @@ Run the command below to
```bash ```bash
./run.sh ./run.sh
``` ```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash ```bash
./run.sh --stage 0 --stop-stage 0 ./run.sh --stage 0 --stop-stage 0
``` ```
...@@ -40,7 +40,7 @@ optional arguments: ...@@ -40,7 +40,7 @@ optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--config FILE path of the config file to overwrite to default config --config FILE path of the config file to overwrite to default config
with. with.
--data DATA_DIR path to the datatset. --data DATA_DIR path to the dataset.
--output OUTPUT_DIR path to save checkpoint and logs. --output OUTPUT_DIR path to save checkpoint and logs.
--checkpoint_path CHECKPOINT_PATH --checkpoint_path CHECKPOINT_PATH
path of the checkpoint to load path of the checkpoint to load
...@@ -50,9 +50,9 @@ optional arguments: ...@@ -50,9 +50,9 @@ optional arguments:
``` ```
If you want to train on CPU, just set `--ngpu=0`. If you want to train on CPU, just set `--ngpu=0`.
If you want to train on multiple GPUs, just set `--ngpu` as num of GPU. If you want to train on multiple GPUs, just set `--ngpu` as the num of GPU.
By default, training will be resumed from the latest checkpoint in `--output`, if you want to start a new training, please use a new `${OUTPUTPATH}` with no checkpoint. By default, training will be resumed from the latest checkpoint in `--output`, if you want to start a new training, please use a new `${OUTPUTPATH}` with no checkpoint.
And if you want to resume from an other existing model, you should set `checkpoint_path` to be the checkpoint path you want to load. And if you want to resume from another existing model, you should set `checkpoint_path` to be the checkpoint path you want to load.
**Note: The checkpoint path cannot contain the file extension.** **Note: The checkpoint path cannot contain the file extension.**
### Synthesizing ### Synthesizing
...@@ -79,11 +79,11 @@ optional arguments: ...@@ -79,11 +79,11 @@ optional arguments:
config, passing in KEY VALUE pairs config, passing in KEY VALUE pairs
-v, --verbose print msg -v, --verbose print msg
``` ```
**Ps.** You can use [waveflow](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0) as the neural vocoder to synthesize mels to wavs. (Please refer to `synthesize.sh` in our LJSpeech waveflow example) **Ps.** You can use [waveflow](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0) as the neural vocoder to synthesize mels to wavs. (Please refer to `synthesize.sh` in our LJSpeech waveflow example)
## Pretrained Models ## Pretrained Models
Pretrained Models can be downloaded from links below. We provide 2 models with different configurations. Pretrained Models can be downloaded from the links below. We provide 2 models with different configurations.
1. This model use a binary classifier to predict the stop token. [tacotron2_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.3.zip) 1. This model uses a binary classifier to predict the stop token. [tacotron2_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.3.zip)
2. This model does not have a stop token predictor. It uses the attention peak position to decided whether all the contents have been uttered. Also guided attention loss is used to speed up training. This model is trained with `configs/alternative.yaml`.[tacotron2_ljspeech_ckpt_0.3_alternative.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.3_alternative.zip) 2. This model does not have a stop token predictor. It uses the attention peak position to decide whether all the contents have been uttered. Also, guided attention loss is used to speed up training. This model is trained with `configs/alternative.yaml`.[tacotron2_ljspeech_ckpt_0.3_alternative.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.3_alternative.zip)
...@@ -18,7 +18,7 @@ Run the command below to ...@@ -18,7 +18,7 @@ Run the command below to
```bash ```bash
./run.sh ./run.sh
``` ```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash ```bash
./run.sh --stage 0 --stop-stage 0 ./run.sh --stage 0 --stop-stage 0
``` ```
...@@ -42,9 +42,9 @@ dump ...@@ -42,9 +42,9 @@ dump
├── raw ├── raw
└── speech_stats.npy └── speech_stats.npy
``` ```
The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains speech feature of each utterances, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/speech_stats.npy`. The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains the speech feature of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/speech_stats.npy`.
Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, path of speech features, speaker and id of each utterance. Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, the path of speech features, speaker, and id of each utterance.
### Model Training ### Model Training
`./local/train.sh` calls `${BIN_DIR}/train.py`. `./local/train.sh` calls `${BIN_DIR}/train.py`.
...@@ -75,7 +75,7 @@ optional arguments: ...@@ -75,7 +75,7 @@ optional arguments:
``` ```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. 3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
5. `--phones-dict` is the path of the phone vocabulary file. 5. `--phones-dict` is the path of the phone vocabulary file.
...@@ -85,7 +85,7 @@ Download Pretrained WaveFlow Model with residual channel equals 128 from [wavefl ...@@ -85,7 +85,7 @@ Download Pretrained WaveFlow Model with residual channel equals 128 from [wavefl
```bash ```bash
unzip waveflow_ljspeech_ckpt_0.3.zip unzip waveflow_ljspeech_ckpt_0.3.zip
``` ```
WaveFlow checkpoint contains files listed below. WaveFlow checkpoint contains files listed below.
```text ```text
waveflow_ljspeech_ckpt_0.3 waveflow_ljspeech_ckpt_0.3
├── config.yaml # default config used to train waveflow ├── config.yaml # default config used to train waveflow
......
fs : 22050 # Hz, sample rate fs : 22050 # Hz, sample rate
n_fft : 1024 # fft frame size n_fft : 1024 # FFT size (samples).
win_length : 1024 # window size win_length : 1024 # Window length (samples). 46.4ms
n_shift : 256 # hop size between ajacent frame n_shift : 256 # Hop size (samples). 11.6ms
fmin : 0 # Hz, min frequency when converting to mel fmin : 0 # Hz, min frequency when converting to mel
fmax : 8000 # Hz, max frequency when converting to mel fmax : 8000 # Hz, max frequency when converting to mel
n_mels : 80 # mel bands n_mels : 80 # mel bands
......
...@@ -7,11 +7,11 @@ ckpt_name=$3 ...@@ -7,11 +7,11 @@ ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize.py \ python3 ${BIN_DIR}/synthesize.py \
--transformer-tts-config=${config_path} \ --transformer-tts-config=${config_path} \
--transformer-tts-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ --transformer-tts-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
--transformer-tts-stat=dump/train/speech_stats.npy \ --transformer-tts-stat=dump/train/speech_stats.npy \
--waveflow-config=waveflow_ljspeech_ckpt_0.3/config.yaml \ --waveflow-config=waveflow_ljspeech_ckpt_0.3/config.yaml \
--waveflow-checkpoint=waveflow_ljspeech_ckpt_0.3/step-2000000.pdparams \ --waveflow-checkpoint=waveflow_ljspeech_ckpt_0.3/step-2000000.pdparams \
--test-metadata=dump/test/norm/metadata.jsonl \ --test-metadata=dump/test/norm/metadata.jsonl \
--output-dir=${train_output_path}/test \ --output-dir=${train_output_path}/test \
--phones-dict=dump/phone_id_map.txt --phones-dict=dump/phone_id_map.txt
...@@ -7,11 +7,11 @@ ckpt_name=$3 ...@@ -7,11 +7,11 @@ ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize_e2e.py \ python3 ${BIN_DIR}/synthesize_e2e.py \
--transformer-tts-config=${config_path} \ --transformer-tts-config=${config_path} \
--transformer-tts-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ --transformer-tts-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
--transformer-tts-stat=dump/train/speech_stats.npy \ --transformer-tts-stat=dump/train/speech_stats.npy \
--waveflow-config=waveflow_ljspeech_ckpt_0.3/config.yaml \ --waveflow-config=waveflow_ljspeech_ckpt_0.3/config.yaml \
--waveflow-checkpoint=waveflow_ljspeech_ckpt_0.3/step-2000000.pdparams \ --waveflow-checkpoint=waveflow_ljspeech_ckpt_0.3/step-2000000.pdparams \
--text=${BIN_DIR}/../sentences_en.txt \ --text=${BIN_DIR}/../sentences_en.txt \
--output-dir=${train_output_path}/test_e2e \ --output-dir=${train_output_path}/test_e2e \
--phones-dict=dump/phone_id_map.txt --phones-dict=dump/phone_id_map.txt
...@@ -7,7 +7,7 @@ Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech ...@@ -7,7 +7,7 @@ Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech
### Get MFA Result and Extract ### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2. We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2.
You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
## Get Started ## Get Started
Assume the path to the dataset is `~/datasets/LJSpeech-1.1`. Assume the path to the dataset is `~/datasets/LJSpeech-1.1`.
...@@ -22,7 +22,7 @@ Run the command below to ...@@ -22,7 +22,7 @@ Run the command below to
```bash ```bash
./run.sh ./run.sh
``` ```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash ```bash
./run.sh --stage 0 --stop-stage 0 ./run.sh --stage 0 --stop-stage 0
``` ```
...@@ -49,9 +49,9 @@ dump ...@@ -49,9 +49,9 @@ dump
├── raw ├── raw
└── speech_stats.npy └── speech_stats.npy
``` ```
The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains speech、pitch and energy features of each utterances, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`. The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech、pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, durations, path of speech features, path of pitch features, path of energy features, speaker and id of each utterance. Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, the path of pitch features, the path of energy features, speaker, and id of each utterance.
### Model Training ### Model Training
`./local/train.sh` calls `${BIN_DIR}/train.py`. `./local/train.sh` calls `${BIN_DIR}/train.py`.
...@@ -85,7 +85,7 @@ optional arguments: ...@@ -85,7 +85,7 @@ optional arguments:
``` ```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. 3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
5. `--phones-dict` is the path of the phone vocabulary file. 5. `--phones-dict` is the path of the phone vocabulary file.
...@@ -102,97 +102,115 @@ pwg_ljspeech_ckpt_0.5 ...@@ -102,97 +102,115 @@ pwg_ljspeech_ckpt_0.5
├── pwg_snapshot_iter_400000.pdz # generator parameters of parallel wavegan ├── pwg_snapshot_iter_400000.pdz # generator parameters of parallel wavegan
└── pwg_stats.npy # statistics used to normalize spectrogram when training parallel wavegan └── pwg_stats.npy # statistics used to normalize spectrogram when training parallel wavegan
``` ```
`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`. `./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
```bash ```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
``` ```
```text ``text
usage: synthesize.py [-h] [--fastspeech2-config FASTSPEECH2_CONFIG] usage: synthesize.py [-h]
[--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT] [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
[--fastspeech2-stat FASTSPEECH2_STAT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--pwg-config PWG_CONFIG] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--pwg-checkpoint PWG_CHECKPOINT] [--pwg-stat PWG_STAT] [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT]
[--phones-dict PHONES_DICT] [--speaker-dict SPEAKER_DICT] [--voice-cloning VOICE_CLONING]
[--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
[--ngpu NGPU] [--verbose VERBOSE] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--ngpu NGPU]
Synthesize with fastspeech2 & parallel wavegan. [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--fastspeech2-config FASTSPEECH2_CONFIG --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
fastspeech2 config file. Choose acoustic model type of tts task.
--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT --am_config AM_CONFIG
fastspeech2 checkpoint to load. Config of acoustic model. Use deault config when it is
--fastspeech2-stat FASTSPEECH2_STAT None.
mean and standard deviation used to normalize --am_ckpt AM_CKPT Checkpoint file of acoustic model.
spectrogram when training fastspeech2. --am_stat AM_STAT mean and standard deviation used to normalize
--pwg-config PWG_CONFIG spectrogram when training acoustic model.
parallel wavegan config file. --phones_dict PHONES_DICT
--pwg-checkpoint PWG_CHECKPOINT
parallel wavegan generator parameters to load.
--pwg-stat PWG_STAT mean and standard deviation used to normalize
spectrogram when training parallel wavegan.
--phones-dict PHONES_DICT
phone vocabulary file. phone vocabulary file.
--speaker-dict SPEAKER_DICT --tones_dict TONES_DICT
speaker id map file for multiple speaker model. tone vocabulary file.
--test-metadata TEST_METADATA --speaker_dict SPEAKER_DICT
speaker id map file.
--voice-cloning VOICE_CLONING
whether training voice cloning model.
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
Choose vocoder type of tts task.
--voc_config VOC_CONFIG
Config of voc. Use deault config when it is None.
--voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc.
--ngpu NGPU if ngpu == 0, use cpu.
--test_metadata TEST_METADATA
test metadata. test metadata.
--output-dir OUTPUT_DIR --output_dir OUTPUT_DIR
output dir. output dir.
--ngpu NGPU if ngpu == 0, use cpu.
--verbose VERBOSE verbose.
``` ```
`./local/synthesize_e2e.sh` calls `${BIN_DIR}/synthesize_e2e_en.py`, which can synthesize waveform from text file. `./local/synthesize_e2e.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file.
```bash ```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name}
``` ```
```text ```text
usage: synthesize_e2e.py [-h] [--fastspeech2-config FASTSPEECH2_CONFIG] usage: synthesize_e2e.py [-h]
[--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT] [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
[--fastspeech2-stat FASTSPEECH2_STAT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--pwg-config PWG_CONFIG] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--pwg-checkpoint PWG_CHECKPOINT] [--tones_dict TONES_DICT]
[--pwg-stat PWG_STAT] [--phones-dict PHONES_DICT] [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
[--text TEXT] [--output-dir OUTPUT_DIR] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
[--inference-dir INFERENCE_DIR] [--ngpu NGPU] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--verbose VERBOSE] [--voc_stat VOC_STAT] [--lang LANG]
[--inference_dir INFERENCE_DIR] [--ngpu NGPU]
Synthesize with fastspeech2 & parallel wavegan. [--text TEXT] [--output_dir OUTPUT_DIR]
Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--fastspeech2-config FASTSPEECH2_CONFIG --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
fastspeech2 config file. Choose acoustic model type of tts task.
--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT --am_config AM_CONFIG
fastspeech2 checkpoint to load. Config of acoustic model. Use deault config when it is
--fastspeech2-stat FASTSPEECH2_STAT None.
mean and standard deviation used to normalize --am_ckpt AM_CKPT Checkpoint file of acoustic model.
spectrogram when training fastspeech2. --am_stat AM_STAT mean and standard deviation used to normalize
--pwg-config PWG_CONFIG spectrogram when training acoustic model.
parallel wavegan config file. --phones_dict PHONES_DICT
--pwg-checkpoint PWG_CHECKPOINT
parallel wavegan generator parameters to load.
--pwg-stat PWG_STAT mean and standard deviation used to normalize
spectrogram when training parallel wavegan.
--phones-dict PHONES_DICT
phone vocabulary file. phone vocabulary file.
--text TEXT text to synthesize, a 'utt_id sentence' pair per line. --tones_dict TONES_DICT
--output-dir OUTPUT_DIR tone vocabulary file.
output dir. --speaker_dict SPEAKER_DICT
--inference-dir INFERENCE_DIR speaker id map file.
--spk_id SPK_ID spk id for multi speaker acoustic model
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
Choose vocoder type of tts task.
--voc_config VOC_CONFIG
Config of voc. Use deault config when it is None.
--voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc.
--lang LANG Choose model language. zh or en
--inference_dir INFERENCE_DIR
dir to save inference models dir to save inference models
--ngpu NGPU if ngpu == 0, use cpu. --ngpu NGPU if ngpu == 0, use cpu.
--verbose VERBOSE verbose. --text TEXT text to synthesize, a 'utt_id sentence' pair per line.
--output_dir OUTPUT_DIR
output dir.
``` ```
1. `--am` is acoustic model type with the format {model_name}_{dataset}
1. `--fastspeech2-config`, `--fastspeech2-checkpoint`, `--fastspeech2-stat` and `--phones-dict` are arguments for fastspeech2, which correspond to the 4 files in the fastspeech2 pretrained model. 2. `--am_config`, `--am_checkpoint`, `--am_stat` and `--phones_dict` are arguments for acoustic model, which correspond to the 4 files in the fastspeech2 pretrained model.
2. `--pwg-config`, `--pwg-checkpoint`, `--pwg-stat` are arguments for parallel wavegan, which correspond to the 3 files in the parallel wavegan pretrained model. 3. `--voc` is vocoder type with the format {model_name}_{dataset}
3. `--test-metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder. 4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
4. `--text` is the text file, which contains sentences to synthesize. 5. `--lang` is the model language, which can be `zh` or `en`.
5. `--output-dir` is the directory to save synthesized audio files. 6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder.
6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 7. `--text` is the text file, which contains sentences to synthesize.
8. `--output_dir` is the directory to save synthesized audio files.
9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model ## Pretrained Model
Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip) Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)
...@@ -216,14 +234,18 @@ source path.sh ...@@ -216,14 +234,18 @@ source path.sh
FLAGS_allocator_strategy=naive_best_fit \ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize_e2e_en.py \ python3 ${BIN_DIR}/../synthesize_e2e.py \
--fastspeech2-config=fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml \ --am=fastspeech2_ljspeech \
--fastspeech2-checkpoint=fastspeech2_nosil_ljspeech_ckpt_0.5/snapshot_iter_100000.pdz \ --am_config=fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml \
--fastspeech2-stat=fastspeech2_nosil_ljspeech_ckpt_0.5/speech_stats.npy \ --am_ckpt=fastspeech2_nosil_ljspeech_ckpt_0.5/snapshot_iter_100000.pdz \
--pwg-config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \ --am_stat=fastspeech2_nosil_ljspeech_ckpt_0.5/speech_stats.npy \
--pwg-checkpoint=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz \ --voc=pwgan_ljspeech\
--pwg-stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \ --voc_config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \
--voc_ckpt=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz \
--voc_stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \
--lang=en \
--text=${BIN_DIR}/../sentences_en.txt \ --text=${BIN_DIR}/../sentences_en.txt \
--output-dir=exp/default/test_e2e \ --output_dir=exp/default/test_e2e \
--phones-dict=fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt --inference_dir=exp/default/inference \
--phones_dict=fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt
``` ```
...@@ -3,9 +3,9 @@ ...@@ -3,9 +3,9 @@
########################################################### ###########################################################
fs: 22050 # sr fs: 22050 # sr
n_fft: 1024 # FFT size. n_fft: 1024 # FFT size (samples).
n_shift: 256 # Hop size. n_shift: 256 # Hop size (samples). 11.6ms
win_length: null # Window length. win_length: null # Window length (samples).
# If set to null, it will be the same as fft_size. # If set to null, it will be the same as fft_size.
window: "hann" # Window function. window: "hann" # Window function.
......
...@@ -6,13 +6,15 @@ ckpt_name=$3 ...@@ -6,13 +6,15 @@ ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize.py \ python3 ${BIN_DIR}/../synthesize.py \
--fastspeech2-config=${config_path} \ --am=fastspeech2_ljspeech \
--fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ --am_config=${config_path} \
--fastspeech2-stat=dump/train/speech_stats.npy \ --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--pwg-config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \ --am_stat=dump/train/speech_stats.npy \
--pwg-checkpoint=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz \ --voc=pwgan_ljspeech \
--pwg-stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \ --voc_config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \
--test-metadata=dump/test/norm/metadata.jsonl \ --voc_ckpt=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz \
--output-dir=${train_output_path}/test \ --voc_stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \
--phones-dict=dump/phone_id_map.txt --test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt
...@@ -6,13 +6,17 @@ ckpt_name=$3 ...@@ -6,13 +6,17 @@ ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize_e2e_en.py \ python3 ${BIN_DIR}/../synthesize_e2e.py \
--fastspeech2-config=${config_path} \ --am=fastspeech2_ljspeech \
--fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ --am_config=${config_path} \
--fastspeech2-stat=dump/train/speech_stats.npy \ --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--pwg-config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \ --am_stat=dump/train/speech_stats.npy \
--pwg-checkpoint=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz \ --voc=pwgan_ljspeech \
--pwg-stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \ --voc_config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \
--text=${BIN_DIR}/../sentences_en.txt \ --voc_ckpt=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz \
--output-dir=${train_output_path}/test_e2e \ --voc_stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \
--phones-dict=dump/phone_id_map.txt --lang=en \
--text=${BIN_DIR}/../sentences_en.txt \
--output_dir=${train_output_path}/test_e2e \
--inference_dir=${train_output_path}/inference \
--phones_dict=dump/phone_id_map.txt
\ No newline at end of file
...@@ -17,7 +17,7 @@ Run the command below to ...@@ -17,7 +17,7 @@ Run the command below to
```bash ```bash
./run.sh ./run.sh
``` ```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash ```bash
./run.sh --stage 0 --stop-stage 0 ./run.sh --stage 0 --stop-stage 0
``` ```
...@@ -45,7 +45,7 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${input_mel_path} ${train_out ...@@ -45,7 +45,7 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${input_mel_path} ${train_out
Synthesize waveform. Synthesize waveform.
1. We assume the `--input` is a directory containing several mel spectrograms(log magnitude) in `.npy` format. 1. We assume the `--input` is a directory containing several mel spectrograms(log magnitude) in `.npy` format.
2. The output would be saved in `--output` directory, containing several `.wav` files, each with the same name as the mel spectrogram does. 2. The output would be saved in the `--output` directory, containing several `.wav` files, each with the same name as the mel spectrogram does.
3. `--checkpoint_path` should be the path of the parameter file (`.pdparams`) to load. Note that the extention name `.pdparmas` is not included here. 3. `--checkpoint_path` should be the path of the parameter file (`.pdparams`) to load. Note that the extention name `.pdparmas` is not included here.
6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
......
...@@ -4,8 +4,8 @@ This example contains code used to train a [parallel wavegan](http://arxiv.org/a ...@@ -4,8 +4,8 @@ This example contains code used to train a [parallel wavegan](http://arxiv.org/a
### Download and Extract ### Download and Extract
Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech-Dataset/). Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech-Dataset/).
### Get MFA Result and Extract ### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut the silence in the edge of audio.
You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
## Get Started ## Get Started
Assume the path to the dataset is `~/datasets/LJSpeech-1.1`. Assume the path to the dataset is `~/datasets/LJSpeech-1.1`.
...@@ -19,7 +19,7 @@ Run the command below to ...@@ -19,7 +19,7 @@ Run the command below to
```bash ```bash
./run.sh ./run.sh
``` ```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash ```bash
./run.sh --stage 0 --stop-stage 0 ./run.sh --stage 0 --stop-stage 0
``` ```
...@@ -43,9 +43,9 @@ dump ...@@ -43,9 +43,9 @@ dump
└── feats_stats.npy └── feats_stats.npy
``` ```
The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains log magnitude of mel spectrogram of each utterances, while the norm folder contains normalized spectrogram. The statistics used to normalize the spectrogram is computed from the training set, which is located in `dump/train/feats_stats.npy`. The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance. Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
### Model Training ### Model Training
`./local/train.sh` calls `${BIN_DIR}/train.py`. `./local/train.sh` calls `${BIN_DIR}/train.py`.
...@@ -91,7 +91,7 @@ benchmark: ...@@ -91,7 +91,7 @@ benchmark:
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. 3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
### Synthesizing ### Synthesizing
...@@ -100,15 +100,19 @@ benchmark: ...@@ -100,15 +100,19 @@ benchmark:
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
``` ```
```text ```text
usage: synthesize.py [-h] [--config CONFIG] [--checkpoint CHECKPOINT] usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
[--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR] [--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
[--ngpu NGPU] [--verbose VERBOSE] [--output-dir OUTPUT_DIR] [--ngpu NGPU]
[--verbose VERBOSE]
Synthesize with parallel wavegan. Synthesize with GANVocoder.
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--config CONFIG parallel wavegan config file. --generator-type GENERATOR_TYPE
type of GANVocoder, should in {pwgan, mb_melgan,
style_melgan, } now
--config CONFIG GANVocoder config file.
--checkpoint CHECKPOINT --checkpoint CHECKPOINT
snapshot to load. snapshot to load.
--test-metadata TEST_METADATA --test-metadata TEST_METADATA
......
...@@ -7,9 +7,9 @@ ...@@ -7,9 +7,9 @@
# FEATURE EXTRACTION SETTING # # FEATURE EXTRACTION SETTING #
########################################################### ###########################################################
fs: 22050 # Sampling rate. fs: 22050 # Sampling rate.
n_fft: 1024 # FFT size. (in samples) n_fft: 1024 # FFT size (samples).
n_shift: 256 # Hop size. (in samples) n_shift: 256 # Hop size (samples). 11.6ms
win_length: null # Window length. (in samples) win_length: null # Window length (samples).
# If set to null, it will be the same as fft_size. # If set to null, it will be the same as fft_size.
window: "hann" # Window function. window: "hann" # Window function.
n_mels: 80 # Number of mel basis. n_mels: 80 # Number of mel basis.
...@@ -49,9 +49,9 @@ discriminator_params: ...@@ -49,9 +49,9 @@ discriminator_params:
bias: true # Whether to use bias parameter in conv. bias: true # Whether to use bias parameter in conv.
use_weight_norm: true # Whether to use weight norm. use_weight_norm: true # Whether to use weight norm.
# If set to true, it will be applied to all of the conv layers. # If set to true, it will be applied to all of the conv layers.
nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv. nonlinear_activation: "leakyrelu" # Nonlinear function after each conv.
nonlinear_activation_params: # Nonlinear function parameters nonlinear_activation_params: # Nonlinear function parameters
negative_slope: 0.2 # Alpha in LeakyReLU. negative_slope: 0.2 # Alpha in leakyrelu.
########################################################### ###########################################################
# STFT LOSS SETTING # # STFT LOSS SETTING #
......
...@@ -7,8 +7,8 @@ ckpt_name=$3 ...@@ -7,8 +7,8 @@ ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \ python3 ${BIN_DIR}/../synthesize.py \
--config=${config_path} \ --config=${config_path} \
--checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
--test-metadata=dump/test/norm/metadata.jsonl \ --test-metadata=dump/test/norm/metadata.jsonl \
--output-dir=${train_output_path}/test \ --output-dir=${train_output_path}/test \
--generator-type=pwgan --generator-type=pwgan
...@@ -2,14 +2,14 @@ ...@@ -2,14 +2,14 @@
This example contains code used to train a [Fastspeech2](https://arxiv.org/abs/2006.04558) model with [VCTK](https://datashare.ed.ac.uk/handle/10283/3443). This example contains code used to train a [Fastspeech2](https://arxiv.org/abs/2006.04558) model with [VCTK](https://datashare.ed.ac.uk/handle/10283/3443).
## Dataset ## Dataset
### Download and Extract the datasaet ### Download and Extract the dataset
Download VCTK-0.92 from the [official website](https://datashare.ed.ac.uk/handle/10283/3443). Download VCTK-0.92 from the [official website](https://datashare.ed.ac.uk/handle/10283/3443).
### Get MFA Result and Extract ### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2. We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2.
You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/mfa/local/reorganize_vctk.py)): ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/mfa/local/reorganize_vctk.py)):
1. `p315`, because no txt for it. 1. `p315`, because of no text for it.
2. `p280` and `p362`, because no *_mic2.flac (which is better than *_mic1.flac) for them. 2. `p280` and `p362`, because no *_mic2.flac (which is better than *_mic1.flac) for them.
## Get Started ## Get Started
...@@ -25,7 +25,7 @@ Run the command below to ...@@ -25,7 +25,7 @@ Run the command below to
```bash ```bash
./run.sh ./run.sh
``` ```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash ```bash
./run.sh --stage 0 --stop-stage 0 ./run.sh --stage 0 --stop-stage 0
``` ```
...@@ -52,9 +52,9 @@ dump ...@@ -52,9 +52,9 @@ dump
├── raw ├── raw
└── speech_stats.npy └── speech_stats.npy
``` ```
The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains speech、pitch and energy features of each utterances, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`. The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech、pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, durations, path of speech features, path of pitch features, path of energy features, speaker and id of each utterance. Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, the path of pitch features, the path of energy features, speaker, and id of each utterance.
### Model Training ### Model Training
```bash ```bash
...@@ -88,7 +88,7 @@ optional arguments: ...@@ -88,7 +88,7 @@ optional arguments:
``` ```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. 3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4. `--phones-dict` is the path of the phone vocabulary file. 4. `--phones-dict` is the path of the phone vocabulary file.
### Synthesizing ### Synthesizing
...@@ -105,99 +105,115 @@ pwg_vctk_ckpt_0.5 ...@@ -105,99 +105,115 @@ pwg_vctk_ckpt_0.5
├── pwg_snapshot_iter_1000000.pdz # generator parameters of parallel wavegan ├── pwg_snapshot_iter_1000000.pdz # generator parameters of parallel wavegan
└── pwg_stats.npy # statistics used to normalize spectrogram when training parallel wavegan └── pwg_stats.npy # statistics used to normalize spectrogram when training parallel wavegan
``` ```
`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`. `./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
```bash ```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
``` ```
```text ```text
usage: synthesize.py [-h] [--fastspeech2-config FASTSPEECH2_CONFIG] usage: synthesize.py [-h]
[--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT] [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
[--fastspeech2-stat FASTSPEECH2_STAT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--pwg-config PWG_CONFIG] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--pwg-checkpoint PWG_CHECKPOINT] [--pwg-stat PWG_STAT] [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT]
[--phones-dict PHONES_DICT] [--speaker-dict SPEAKER_DICT] [--voice-cloning VOICE_CLONING]
[--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
[--ngpu NGPU] [--verbose VERBOSE] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--ngpu NGPU]
Synthesize with fastspeech2 & parallel wavegan. [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--fastspeech2-config FASTSPEECH2_CONFIG --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
fastspeech2 config file. Choose acoustic model type of tts task.
--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT --am_config AM_CONFIG
fastspeech2 checkpoint to load. Config of acoustic model. Use deault config when it is
--fastspeech2-stat FASTSPEECH2_STAT None.
mean and standard deviation used to normalize --am_ckpt AM_CKPT Checkpoint file of acoustic model.
spectrogram when training fastspeech2. --am_stat AM_STAT mean and standard deviation used to normalize
--pwg-config PWG_CONFIG spectrogram when training acoustic model.
parallel wavegan config file. --phones_dict PHONES_DICT
--pwg-checkpoint PWG_CHECKPOINT
parallel wavegan generator parameters to load.
--pwg-stat PWG_STAT mean and standard deviation used to normalize
spectrogram when training parallel wavegan.
--phones-dict PHONES_DICT
phone vocabulary file. phone vocabulary file.
--speaker-dict SPEAKER_DICT --tones_dict TONES_DICT
speaker id map file for multiple speaker model. tone vocabulary file.
--test-metadata TEST_METADATA --speaker_dict SPEAKER_DICT
speaker id map file.
--voice-cloning VOICE_CLONING
whether training voice cloning model.
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
Choose vocoder type of tts task.
--voc_config VOC_CONFIG
Config of voc. Use deault config when it is None.
--voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc.
--ngpu NGPU if ngpu == 0, use cpu.
--test_metadata TEST_METADATA
test metadata. test metadata.
--output-dir OUTPUT_DIR --output_dir OUTPUT_DIR
output dir. output dir.
--ngpu NGPU if ngpu == 0, use cpu.
--verbose VERBOSE verbose.
``` ```
`./local/synthesize_e2e.sh` calls `${BIN_DIR}/multi_spk_synthesize_e2e_en.py`, which can synthesize waveform from text file. `./local/synthesize_e2e.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file.
```bash ```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name}
``` ```
```text ```text
usage: multi_spk_synthesize_e2e_en.py [-h] usage: synthesize_e2e.py [-h]
[--fastspeech2-config FASTSPEECH2_CONFIG] [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
[--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--fastspeech2-stat FASTSPEECH2_STAT] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--pwg-config PWG_CONFIG] [--tones_dict TONES_DICT]
[--pwg-checkpoint PWG_CHECKPOINT] [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
[--pwg-stat PWG_STAT] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
[--phones-dict PHONES_DICT] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--speaker-dict SPEAKER_DICT] [--voc_stat VOC_STAT] [--lang LANG]
[--text TEXT] [--output-dir OUTPUT_DIR] [--inference_dir INFERENCE_DIR] [--ngpu NGPU]
[--ngpu NGPU] [--verbose VERBOSE] [--text TEXT] [--output_dir OUTPUT_DIR]
Synthesize with fastspeech2 & parallel wavegan. Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--fastspeech2-config FASTSPEECH2_CONFIG --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
fastspeech2 config file. Choose acoustic model type of tts task.
--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT --am_config AM_CONFIG
fastspeech2 checkpoint to load. Config of acoustic model. Use deault config when it is
--fastspeech2-stat FASTSPEECH2_STAT None.
mean and standard deviation used to normalize --am_ckpt AM_CKPT Checkpoint file of acoustic model.
spectrogram when training fastspeech2. --am_stat AM_STAT mean and standard deviation used to normalize
--pwg-config PWG_CONFIG spectrogram when training acoustic model.
parallel wavegan config file. --phones_dict PHONES_DICT
--pwg-checkpoint PWG_CHECKPOINT
parallel wavegan generator parameters to load.
--pwg-stat PWG_STAT mean and standard deviation used to normalize
spectrogram when training parallel wavegan.
--phones-dict PHONES_DICT
phone vocabulary file. phone vocabulary file.
--speaker-dict SPEAKER_DICT --tones_dict TONES_DICT
tone vocabulary file.
--speaker_dict SPEAKER_DICT
speaker id map file. speaker id map file.
--spk_id SPK_ID spk id for multi speaker acoustic model
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
Choose vocoder type of tts task.
--voc_config VOC_CONFIG
Config of voc. Use deault config when it is None.
--voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc.
--lang LANG Choose model language. zh or en
--inference_dir INFERENCE_DIR
dir to save inference models
--ngpu NGPU if ngpu == 0, use cpu.
--text TEXT text to synthesize, a 'utt_id sentence' pair per line. --text TEXT text to synthesize, a 'utt_id sentence' pair per line.
--output-dir OUTPUT_DIR --output_dir OUTPUT_DIR
output dir. output dir.
--ngpu NGPU if ngpu == 0, use cpu.
--verbose VERBOSE verbose.
``` ```
1. `--am` is acoustic model type with the format {model_name}_{dataset}
1. `--fastspeech2-config`, `--fastspeech2-checkpoint`, `--fastspeech2-stat` and `--phones-dict` are arguments for fastspeech2, which correspond to the 4 files in the fastspeech2 pretrained model. 2. `--am_config`, `--am_checkpoint`, `--am_stat`, `--phones_dict` `--speaker_dict` are arguments for acoustic model, which correspond to the 5 files in the fastspeech2 pretrained model.
2. `--pwg-config`, `--pwg-checkpoint`, `--pwg-stat` are arguments for parallel wavegan, which correspond to the 3 files in the parallel wavegan pretrained model. 3. `--voc` is vocoder type with the format {model_name}_{dataset}
3. `--test-metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder. 4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
4. `--text` is the text file, which contains sentences to synthesize. 5. `--lang` is the model language, which can be `zh` or `en`.
5. `--output-dir` is the directory to save synthesized audio files. 6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder.
6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 7. `--text` is the text file, which contains sentences to synthesize.
8. `--output_dir` is the directory to save synthesized audio files.
9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model ## Pretrained Model
Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip) Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip)
...@@ -217,15 +233,19 @@ source path.sh ...@@ -217,15 +233,19 @@ source path.sh
FLAGS_allocator_strategy=naive_best_fit \ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/multi_spk_synthesize_e2e_en.py \ python3 ${BIN_DIR}/../synthesize_e2e.py \
--fastspeech2-config=fastspeech2_nosil_vctk_ckpt_0.5/default.yaml \ --am=fastspeech2_vctk \
--fastspeech2-checkpoint=fastspeech2_nosil_vctk_ckpt_0.5/snapshot_iter_66200.pdz \ --am_config=fastspeech2_nosil_vctk_ckpt_0.5/default.yaml \
--fastspeech2-stat=fastspeech2_nosil_vctk_ckpt_0.5/speech_stats.npy \ --am_ckpt=fastspeech2_nosil_vctk_ckpt_0.5/snapshot_iter_66200.pdz \
--pwg-config=pwg_vctk_ckpt_0.5/pwg_default.yaml \ --am_stat=fastspeech2_nosil_vctk_ckpt_0.5/speech_stats.npy \
--pwg-checkpoint=pwg_vctk_ckpt_0.5/pwg_snapshot_iter_1000000.pdz \ --voc=pwgan_vctk \
--pwg-stat=pwg_vctk_ckpt_0.5/pwg_stats.npy \ --voc_config=pwg_vctk_ckpt_0.5/pwg_default.yaml \
--voc_ckpt=pwg_vctk_ckpt_0.5/pwg_snapshot_iter_1000000.pdz \
--voc_stat=pwg_vctk_ckpt_0.5/pwg_stats.npy \
--lang=en \
--text=${BIN_DIR}/../sentences_en.txt \ --text=${BIN_DIR}/../sentences_en.txt \
--output-dir=exp/default/test_e2e \ --output_dir=exp/default/test_e2e \
--phones-dict=fastspeech2_nosil_vctk_ckpt_0.5/phone_id_map.txt \ --phones_dict=dump/phone_id_map.txt \
--speaker-dict=fastspeech2_nosil_vctk_ckpt_0.5/speaker_id_map.txt --speaker_dict=dump/speaker_id_map.txt \
--spk_id=0
``` ```
...@@ -3,9 +3,9 @@ ...@@ -3,9 +3,9 @@
########################################################### ###########################################################
fs: 24000 # sr fs: 24000 # sr
n_fft: 2048 # FFT size. n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size. n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length. win_length: 1200 # Window length.(in samples) 50ms
# If set to null, it will be the same as fft_size. # If set to null, it will be the same as fft_size.
window: "hann" # Window function. window: "hann" # Window function.
......
...@@ -6,14 +6,16 @@ ckpt_name=$3 ...@@ -6,14 +6,16 @@ ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize.py \ python3 ${BIN_DIR}/../synthesize.py \
--fastspeech2-config=${config_path} \ --am=fastspeech2_vctk \
--fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ --am_config=${config_path} \
--fastspeech2-stat=dump/train/speech_stats.npy \ --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--pwg-config=pwg_vctk_ckpt_0.5/pwg_default.yaml \ --am_stat=dump/train/speech_stats.npy \
--pwg-checkpoint=pwg_vctk_ckpt_0.5/pwg_snapshot_iter_1000000.pdz \ --voc=pwgan_vctk \
--pwg-stat=pwg_vctk_ckpt_0.5/pwg_stats.npy \ --voc_config=pwg_vctk_ckpt_0.5/pwg_default.yaml \
--test-metadata=dump/test/norm/metadata.jsonl \ --voc_ckpt=pwg_vctk_ckpt_0.5/pwg_snapshot_iter_1000000.pdz \
--output-dir=${train_output_path}/test \ --voc_stat=pwg_vctk_ckpt_0.5/pwg_stats.npy \
--phones-dict=dump/phone_id_map.txt \ --test_metadata=dump/test/norm/metadata.jsonl \
--speaker-dict=dump/speaker_id_map.txt --output_dir=${train_output_path}/test \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt
...@@ -6,14 +6,18 @@ ckpt_name=$3 ...@@ -6,14 +6,18 @@ ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/multi_spk_synthesize_e2e_en.py \ python3 ${BIN_DIR}/../synthesize_e2e.py \
--fastspeech2-config=${config_path} \ --am=fastspeech2_vctk \
--fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ --am_config=${config_path} \
--fastspeech2-stat=dump/train/speech_stats.npy \ --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--pwg-config=pwg_vctk_ckpt_0.5/pwg_default.yaml \ --am_stat=dump/train/speech_stats.npy \
--pwg-checkpoint=pwg_vctk_ckpt_0.5/pwg_snapshot_iter_1000000.pdz \ --voc=pwgan_vctk \
--pwg-stat=pwg_vctk_ckpt_0.5/pwg_stats.npy \ --voc_config=pwg_vctk_ckpt_0.5/pwg_default.yaml \
--text=${BIN_DIR}/../sentences_en.txt \ --voc_ckpt=pwg_vctk_ckpt_0.5/pwg_snapshot_iter_1000000.pdz \
--output-dir=${train_output_path}/test_e2e \ --voc_stat=pwg_vctk_ckpt_0.5/pwg_stats.npy \
--phones-dict=dump/phone_id_map.txt \ --lang=en \
--speaker-dict=dump/speaker_id_map.txt --text=${BIN_DIR}/../sentences_en.txt \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \
--spk_id=0
...@@ -6,10 +6,10 @@ This example contains code used to train a [parallel wavegan](http://arxiv.org/a ...@@ -6,10 +6,10 @@ This example contains code used to train a [parallel wavegan](http://arxiv.org/a
Download VCTK-0.92 from the [official website](https://datashare.ed.ac.uk/handle/10283/3443) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/VCTK-Corpus-0.92`. Download VCTK-0.92 from the [official website](https://datashare.ed.ac.uk/handle/10283/3443) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/VCTK-Corpus-0.92`.
### Get MFA Result and Extract ### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut the silence in the edge of audio.
You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/mfa/local/reorganize_vctk.py)): ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/mfa/local/reorganize_vctk.py)):
1. `p315`, because no txt for it. 1. `p315`, because of no text for it.
2. `p280` and `p362`, because no *_mic2.flac (which is better than *_mic1.flac) for them. 2. `p280` and `p362`, because no *_mic2.flac (which is better than *_mic1.flac) for them.
## Get Started ## Get Started
...@@ -24,7 +24,7 @@ Run the command below to ...@@ -24,7 +24,7 @@ Run the command below to
```bash ```bash
./run.sh ./run.sh
``` ```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash ```bash
./run.sh --stage 0 --stop-stage 0 ./run.sh --stage 0 --stop-stage 0
``` ```
...@@ -48,9 +48,9 @@ dump ...@@ -48,9 +48,9 @@ dump
└── feats_stats.npy └── feats_stats.npy
``` ```
The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains log magnitude of mel spectrogram of each utterances, while the norm folder contains normalized spectrogram. The statistics used to normalize the spectrogram is computed from the training set, which is located in `dump/train/feats_stats.npy`. The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance. Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
### Model Training ### Model Training
```bash ```bash
...@@ -96,7 +96,7 @@ benchmark: ...@@ -96,7 +96,7 @@ benchmark:
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. 3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
### Synthesizing ### Synthesizing
...@@ -105,15 +105,19 @@ benchmark: ...@@ -105,15 +105,19 @@ benchmark:
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
``` ```
```text ```text
usage: synthesize.py [-h] [--config CONFIG] [--checkpoint CHECKPOINT] usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
[--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR] [--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
[--ngpu NGPU] [--verbose VERBOSE] [--output-dir OUTPUT_DIR] [--ngpu NGPU]
[--verbose VERBOSE]
Synthesize with parallel wavegan. Synthesize with GANVocoder.
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--config CONFIG parallel wavegan config file. --generator-type GENERATOR_TYPE
type of GANVocoder, should in {pwgan, mb_melgan,
style_melgan, } now
--config CONFIG GANVocoder config file.
--checkpoint CHECKPOINT --checkpoint CHECKPOINT
snapshot to load. snapshot to load.
--test-metadata TEST_METADATA --test-metadata TEST_METADATA
......
...@@ -7,9 +7,9 @@ ...@@ -7,9 +7,9 @@
# FEATURE EXTRACTION SETTING # # FEATURE EXTRACTION SETTING #
########################################################### ###########################################################
fs: 24000 # Sampling rate. fs: 24000 # Sampling rate.
n_fft: 2048 # FFT size. (in samples) n_fft: 2048 # FFT size (samples).
n_shift: 300 # Hop size. (in samples) n_shift: 300 # Hop size (samples). 12.5ms
win_length: 1200 # Window length. (in samples) win_length: 1200 # Window length (samples). 50ms
# If set to null, it will be the same as fft_size. # If set to null, it will be the same as fft_size.
window: "hann" # Window function. window: "hann" # Window function.
n_mels: 80 # Number of mel basis. n_mels: 80 # Number of mel basis.
...@@ -49,9 +49,9 @@ discriminator_params: ...@@ -49,9 +49,9 @@ discriminator_params:
bias: true # Whether to use bias parameter in conv. bias: true # Whether to use bias parameter in conv.
use_weight_norm: true # Whether to use weight norm. use_weight_norm: true # Whether to use weight norm.
# If set to true, it will be applied to all of the conv layers. # If set to true, it will be applied to all of the conv layers.
nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv. nonlinear_activation: "leakyrelu" # Nonlinear function after each conv.
nonlinear_activation_params: # Nonlinear function parameters nonlinear_activation_params: # Nonlinear function parameters
negative_slope: 0.2 # Alpha in LeakyReLU. negative_slope: 0.2 # Alpha in leakyrelu.
########################################################### ###########################################################
# STFT LOSS SETTING # # STFT LOSS SETTING #
......
...@@ -7,8 +7,8 @@ ckpt_name=$3 ...@@ -7,8 +7,8 @@ ckpt_name=$3
FLAGS_allocator_strategy=naive_best_fit \ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \ python3 ${BIN_DIR}/../synthesize.py \
--config=${config_path} \ --config=${config_path} \
--checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
--test-metadata=dump/test/norm/metadata.jsonl \ --test-metadata=dump/test/norm/metadata.jsonl \
--output-dir=${train_output_path}/test \ --output-dir=${train_output_path}/test \
--generator-type=pwgan --generator-type=pwgan
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import logging
from pathlib import Path
import numpy as np
import paddle
import soundfile as sf
import yaml
from yacs.config import CfgNode
from paddlespeech.t2s.frontend.zh_frontend import Frontend
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference
from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
from paddlespeech.t2s.models.parallel_wavegan import PWGInference
from paddlespeech.t2s.modules.normalizer import ZScore
def evaluate(args, fastspeech2_config, pwg_config):
# dataloader has been too verbose
logging.getLogger("DataLoader").disabled = True
# construct dataset for evaluation
sentences = []
with open(args.text, 'rt') as f:
for line in f:
items = line.strip().split()
utt_id = items[0]
sentence = "".join(items[1:])
sentences.append((utt_id, sentence))
with open(args.phones_dict, "r") as f:
phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id)
print("vocab_size:", vocab_size)
with open(args.speaker_dict, 'rt') as f:
spk_id = [line.strip().split() for line in f.readlines()]
spk_num = len(spk_id)
print("spk_num:", spk_num)
odim = fastspeech2_config.n_mels
model = FastSpeech2(
idim=vocab_size,
odim=odim,
spk_num=spk_num,
**fastspeech2_config["model"])
model.set_state_dict(
paddle.load(args.fastspeech2_checkpoint)["main_params"])
model.eval()
vocoder = PWGGenerator(**pwg_config["generator_params"])
vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"])
vocoder.remove_weight_norm()
vocoder.eval()
print("model done!")
frontend = Frontend(phone_vocab_path=args.phones_dict)
print("frontend done!")
stat = np.load(args.fastspeech2_stat)
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
fastspeech2_normalizer = ZScore(mu, std)
stat = np.load(args.pwg_stat)
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
pwg_normalizer = ZScore(mu, std)
fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)
pwg_inference = PWGInference(pwg_normalizer, vocoder)
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
# only test the number 0 speaker
spk_ids = list(range(20))
for spk_id in spk_ids:
for utt_id, sentence in sentences[:2]:
input_ids = frontend.get_input_ids(sentence, merge_sentences=True)
phone_ids = input_ids["phone_ids"]
flags = 0
for part_phone_ids in phone_ids:
with paddle.no_grad():
mel = fastspeech2_inference(
part_phone_ids, spk_id=paddle.to_tensor(spk_id))
temp_wav = pwg_inference(mel)
if flags == 0:
wav = temp_wav
flags = 1
else:
wav = paddle.concat([wav, temp_wav])
sf.write(
str(output_dir / (str(spk_id) + "_" + utt_id + ".wav")),
wav.numpy(),
samplerate=fastspeech2_config.fs)
print(f"{spk_id}_{utt_id} done!")
def main():
# parse args and config and redirect to train_sp
parser = argparse.ArgumentParser(
description="Synthesize with fastspeech2 & parallel wavegan.")
parser.add_argument(
"--fastspeech2-config", type=str, help="fastspeech2 config file.")
parser.add_argument(
"--fastspeech2-checkpoint",
type=str,
help="fastspeech2 checkpoint to load.")
parser.add_argument(
"--fastspeech2-stat",
type=str,
help="mean and standard deviation used to normalize spectrogram when training fastspeech2."
)
parser.add_argument(
"--pwg-config", type=str, help="parallel wavegan config file.")
parser.add_argument(
"--pwg-checkpoint",
type=str,
help="parallel wavegan generator parameters to load.")
parser.add_argument(
"--pwg-stat",
type=str,
help="mean and standard deviation used to normalize spectrogram when training parallel wavegan."
)
parser.add_argument(
"--phones-dict", type=str, default=None, help="phone vocabulary file.")
parser.add_argument(
"--speaker-dict", type=str, default=None, help="speaker id map file.")
parser.add_argument(
"--text",
type=str,
help="text to synthesize, a 'utt_id sentence' pair per line.")
parser.add_argument("--output-dir", type=str, help="output dir.")
parser.add_argument(
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
parser.add_argument("--verbose", type=int, default=1, help="verbose.")
args = parser.parse_args()
if args.ngpu == 0:
paddle.set_device("cpu")
elif args.ngpu > 0:
paddle.set_device("gpu")
else:
print("ngpu should >= 0 !")
with open(args.fastspeech2_config) as f:
fastspeech2_config = CfgNode(yaml.safe_load(f))
with open(args.pwg_config) as f:
pwg_config = CfgNode(yaml.safe_load(f))
print("========Args========")
print(yaml.safe_dump(vars(args)))
print("========Config========")
print(fastspeech2_config)
print(pwg_config)
evaluate(args, fastspeech2_config, pwg_config)
if __name__ == "__main__":
main()
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import logging
from pathlib import Path
import numpy as np
import paddle
import soundfile as sf
import yaml
from yacs.config import CfgNode
from paddlespeech.t2s.frontend import English
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference
from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
from paddlespeech.t2s.models.parallel_wavegan import PWGInference
from paddlespeech.t2s.modules.normalizer import ZScore
def evaluate(args, fastspeech2_config, pwg_config):
# dataloader has been too verbose
logging.getLogger("DataLoader").disabled = True
# construct dataset for evaluation
sentences = []
with open(args.text, 'rt') as f:
for line in f:
line_list = line.strip().split()
utt_id = line_list[0]
sentence = " ".join(line_list[1:])
sentences.append((utt_id, sentence))
with open(args.phones_dict, "r") as f:
phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id)
phone_id_map = {}
for phn, id in phn_id:
phone_id_map[phn] = int(id)
print("vocab_size:", vocab_size)
with open(args.speaker_dict, 'rt') as f:
spk_id = [line.strip().split() for line in f.readlines()]
spk_num = len(spk_id)
print("spk_num:", spk_num)
odim = fastspeech2_config.n_mels
model = FastSpeech2(
idim=vocab_size,
odim=odim,
spk_num=spk_num,
**fastspeech2_config["model"])
model.set_state_dict(
paddle.load(args.fastspeech2_checkpoint)["main_params"])
model.eval()
vocoder = PWGGenerator(**pwg_config["generator_params"])
vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"])
vocoder.remove_weight_norm()
vocoder.eval()
print("model done!")
frontend = English(phone_vocab_path=args.phones_dict)
print("frontend done!")
stat = np.load(args.fastspeech2_stat)
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
fastspeech2_normalizer = ZScore(mu, std)
stat = np.load(args.pwg_stat)
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
pwg_normalizer = ZScore(mu, std)
fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)
pwg_inference = PWGInference(pwg_normalizer, vocoder)
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
# only test the number 0 speaker
spk_id = 0
for utt_id, sentence in sentences:
input_ids = frontend.get_input_ids(sentence)
phone_ids = input_ids["phone_ids"]
with paddle.no_grad():
mel = fastspeech2_inference(
phone_ids, spk_id=paddle.to_tensor(spk_id))
wav = pwg_inference(mel)
sf.write(
str(output_dir / (str(spk_id) + "_" + utt_id + ".wav")),
wav.numpy(),
samplerate=fastspeech2_config.fs)
print(f"{spk_id}_{utt_id} done!")
def main():
# parse args and config and redirect to train_sp
parser = argparse.ArgumentParser(
description="Synthesize with fastspeech2 & parallel wavegan.")
parser.add_argument(
"--fastspeech2-config", type=str, help="fastspeech2 config file.")
parser.add_argument(
"--fastspeech2-checkpoint",
type=str,
help="fastspeech2 checkpoint to load.")
parser.add_argument(
"--fastspeech2-stat",
type=str,
help="mean and standard deviation used to normalize spectrogram when training fastspeech2."
)
parser.add_argument(
"--pwg-config", type=str, help="parallel wavegan config file.")
parser.add_argument(
"--pwg-checkpoint",
type=str,
help="parallel wavegan generator parameters to load.")
parser.add_argument(
"--pwg-stat",
type=str,
help="mean and standard deviation used to normalize spectrogram when training parallel wavegan."
)
parser.add_argument(
"--phones-dict", type=str, default=None, help="phone vocabulary file.")
parser.add_argument(
"--speaker-dict", type=str, default=None, help="speaker id map file.")
parser.add_argument(
"--text",
type=str,
help="text to synthesize, a 'utt_id sentence' pair per line.")
parser.add_argument("--output-dir", type=str, help="output dir.")
parser.add_argument(
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
parser.add_argument("--verbose", type=int, default=1, help="verbose.")
args = parser.parse_args()
if args.ngpu == 0:
paddle.set_device("cpu")
elif args.ngpu > 0:
paddle.set_device("gpu")
else:
print("ngpu should >= 0 !")
with open(args.fastspeech2_config) as f:
fastspeech2_config = CfgNode(yaml.safe_load(f))
with open(args.pwg_config) as f:
pwg_config = CfgNode(yaml.safe_load(f))
print("========Args========")
print(yaml.safe_dump(vars(args)))
print("========Config========")
print(fastspeech2_config)
print(pwg_config)
evaluate(args, fastspeech2_config, pwg_config)
if __name__ == "__main__":
main()
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import logging
import os
from pathlib import Path
import numpy as np
import paddle
import soundfile as sf
import yaml
from paddle import jit
from paddle.static import InputSpec
from yacs.config import CfgNode
from paddlespeech.t2s.frontend.zh_frontend import Frontend
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference
from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
from paddlespeech.t2s.models.parallel_wavegan import PWGInference
from paddlespeech.t2s.modules.normalizer import ZScore
def evaluate(args, fastspeech2_config, pwg_config):
# dataloader has been too verbose
logging.getLogger("DataLoader").disabled = True
# construct dataset for evaluation
sentences = []
with open(args.text, 'rt') as f:
for line in f:
items = line.strip().split()
utt_id = items[0]
sentence = "".join(items[1:])
sentences.append((utt_id, sentence))
with open(args.phones_dict, "r") as f:
phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id)
print("vocab_size:", vocab_size)
odim = fastspeech2_config.n_mels
model = FastSpeech2(
idim=vocab_size, odim=odim, **fastspeech2_config["model"])
model.set_state_dict(
paddle.load(args.fastspeech2_checkpoint)["main_params"])
model.eval()
vocoder = PWGGenerator(**pwg_config["generator_params"])
vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"])
vocoder.remove_weight_norm()
vocoder.eval()
print("model done!")
frontend = Frontend(phone_vocab_path=args.phones_dict)
print("frontend done!")
stat = np.load(args.fastspeech2_stat)
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
fastspeech2_normalizer = ZScore(mu, std)
stat = np.load(args.pwg_stat)
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
pwg_normalizer = ZScore(mu, std)
fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)
fastspeech2_inference.eval()
fastspeech2_inference = jit.to_static(
fastspeech2_inference, input_spec=[InputSpec([-1], dtype=paddle.int64)])
paddle.jit.save(fastspeech2_inference,
os.path.join(args.inference_dir, "fastspeech2"))
fastspeech2_inference = paddle.jit.load(
os.path.join(args.inference_dir, "fastspeech2"))
pwg_inference = PWGInference(pwg_normalizer, vocoder)
pwg_inference.eval()
pwg_inference = jit.to_static(
pwg_inference, input_spec=[
InputSpec([-1, 80], dtype=paddle.float32),
])
paddle.jit.save(pwg_inference, os.path.join(args.inference_dir, "pwg"))
pwg_inference = paddle.jit.load(os.path.join(args.inference_dir, "pwg"))
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
for utt_id, sentence in sentences:
input_ids = frontend.get_input_ids(sentence, merge_sentences=True)
phone_ids = input_ids["phone_ids"]
flags = 0
for part_phone_ids in phone_ids:
with paddle.no_grad():
mel = fastspeech2_inference(part_phone_ids)
temp_wav = pwg_inference(mel)
if flags == 0:
wav = temp_wav
flags = 1
else:
wav = paddle.concat([wav, temp_wav])
sf.write(
str(output_dir / (utt_id + ".wav")),
wav.numpy(),
samplerate=fastspeech2_config.fs)
print(f"{utt_id} done!")
def main():
# parse args and config and redirect to train_sp
parser = argparse.ArgumentParser(
description="Synthesize with fastspeech2 & parallel wavegan.")
parser.add_argument(
"--fastspeech2-config", type=str, help="fastspeech2 config file.")
parser.add_argument(
"--fastspeech2-checkpoint",
type=str,
help="fastspeech2 checkpoint to load.")
parser.add_argument(
"--fastspeech2-stat",
type=str,
help="mean and standard deviation used to normalize spectrogram when training fastspeech2."
)
parser.add_argument(
"--pwg-config", type=str, help="parallel wavegan config file.")
parser.add_argument(
"--pwg-checkpoint",
type=str,
help="parallel wavegan generator parameters to load.")
parser.add_argument(
"--pwg-stat",
type=str,
help="mean and standard deviation used to normalize spectrogram when training parallel wavegan."
)
parser.add_argument(
"--phones-dict",
type=str,
default="phone_id_map.txt",
help="phone vocabulary file.")
parser.add_argument(
"--text",
type=str,
help="text to synthesize, a 'utt_id sentence' pair per line.")
parser.add_argument("--output-dir", type=str, help="output dir.")
parser.add_argument(
"--inference-dir", type=str, help="dir to save inference models")
parser.add_argument(
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
parser.add_argument("--verbose", type=int, default=1, help="verbose.")
args = parser.parse_args()
if args.ngpu == 0:
paddle.set_device("cpu")
elif args.ngpu > 0:
paddle.set_device("gpu")
else:
print("ngpu should >= 0 !")
with open(args.fastspeech2_config) as f:
fastspeech2_config = CfgNode(yaml.safe_load(f))
with open(args.pwg_config) as f:
pwg_config = CfgNode(yaml.safe_load(f))
print("========Args========")
print(yaml.safe_dump(vars(args)))
print("========Config========")
print(fastspeech2_config)
print(pwg_config)
evaluate(args, fastspeech2_config, pwg_config)
if __name__ == "__main__":
main()
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import logging
from pathlib import Path
import numpy as np
import paddle
import soundfile as sf
import yaml
from yacs.config import CfgNode
from paddlespeech.t2s.frontend import English
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference
from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
from paddlespeech.t2s.models.parallel_wavegan import PWGInference
from paddlespeech.t2s.modules.normalizer import ZScore
def evaluate(args, fastspeech2_config, pwg_config):
# dataloader has been too verbose
logging.getLogger("DataLoader").disabled = True
# construct dataset for evaluation
sentences = []
with open(args.text, 'rt') as f:
for line in f:
line_list = line.strip().split()
utt_id = line_list[0]
sentence = " ".join(line_list[1:])
sentences.append((utt_id, sentence))
with open(args.phones_dict, "r") as f:
phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id)
phone_id_map = {}
for phn, id in phn_id:
phone_id_map[phn] = int(id)
print("vocab_size:", vocab_size)
odim = fastspeech2_config.n_mels
model = FastSpeech2(
idim=vocab_size, odim=odim, **fastspeech2_config["model"])
model.set_state_dict(
paddle.load(args.fastspeech2_checkpoint)["main_params"])
model.eval()
vocoder = PWGGenerator(**pwg_config["generator_params"])
vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"])
vocoder.remove_weight_norm()
vocoder.eval()
print("model done!")
frontend = English(phone_vocab_path=args.phones_dict)
print("frontend done!")
stat = np.load(args.fastspeech2_stat)
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
fastspeech2_normalizer = ZScore(mu, std)
stat = np.load(args.pwg_stat)
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
pwg_normalizer = ZScore(mu, std)
fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)
pwg_inference = PWGInference(pwg_normalizer, vocoder)
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
for utt_id, sentence in sentences:
input_ids = frontend.get_input_ids(sentence)
phone_ids = input_ids["phone_ids"]
with paddle.no_grad():
mel = fastspeech2_inference(phone_ids)
wav = pwg_inference(mel)
sf.write(
str(output_dir / (utt_id + ".wav")),
wav.numpy(),
samplerate=fastspeech2_config.fs)
print(f"{utt_id} done!")
def main():
# parse args and config and redirect to train_sp
parser = argparse.ArgumentParser(
description="Synthesize with fastspeech2 & parallel wavegan.")
parser.add_argument(
"--fastspeech2-config", type=str, help="fastspeech2 config file.")
parser.add_argument(
"--fastspeech2-checkpoint",
type=str,
help="fastspeech2 checkpoint to load.")
parser.add_argument(
"--fastspeech2-stat",
type=str,
help="mean and standard deviation used to normalize spectrogram when training fastspeech2."
)
parser.add_argument(
"--pwg-config", type=str, help="parallel wavegan config file.")
parser.add_argument(
"--pwg-checkpoint",
type=str,
help="parallel wavegan generator parameters to load.")
parser.add_argument(
"--pwg-stat",
type=str,
help="mean and standard deviation used to normalize spectrogram when training parallel wavegan."
)
parser.add_argument(
"--phones-dict",
type=str,
default="phone_id_map.txt",
help="phone vocabulary file.")
parser.add_argument(
"--text",
type=str,
help="text to synthesize, a 'utt_id sentence' pair per line.")
parser.add_argument("--output-dir", type=str, help="output dir.")
parser.add_argument(
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
parser.add_argument("--verbose", type=int, default=1, help="verbose.")
args = parser.parse_args()
if args.ngpu == 0:
paddle.set_device("cpu")
elif args.ngpu > 0:
paddle.set_device("gpu")
else:
print("ngpu should >= 0 !")
with open(args.fastspeech2_config) as f:
fastspeech2_config = CfgNode(yaml.safe_load(f))
with open(args.pwg_config) as f:
pwg_config = CfgNode(yaml.safe_load(f))
print("========Args========")
print(yaml.safe_dump(vars(args)))
print("========Config========")
print(fastspeech2_config)
print(pwg_config)
evaluate(args, fastspeech2_config, pwg_config)
if __name__ == "__main__":
main()
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import logging
import os
from pathlib import Path
import numpy as np
import paddle
import soundfile as sf
import yaml
from paddle import jit
from paddle.static import InputSpec
from yacs.config import CfgNode
from paddlespeech.t2s.frontend.zh_frontend import Frontend
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference
from paddlespeech.t2s.models.melgan import MelGANGenerator
from paddlespeech.t2s.models.melgan import MelGANInference
from paddlespeech.t2s.modules.normalizer import ZScore
def evaluate(args, fastspeech2_config, melgan_config):
# dataloader has been too verbose
logging.getLogger("DataLoader").disabled = True
# construct dataset for evaluation
sentences = []
with open(args.text, 'rt') as f:
for line in f:
items = line.strip().split()
utt_id = items[0]
sentence = "".join(items[1:])
sentences.append((utt_id, sentence))
with open(args.phones_dict, "r") as f:
phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id)
print("vocab_size:", vocab_size)
odim = fastspeech2_config.n_mels
model = FastSpeech2(
idim=vocab_size, odim=odim, **fastspeech2_config["model"])
model.set_state_dict(
paddle.load(args.fastspeech2_checkpoint)["main_params"])
model.eval()
vocoder = MelGANGenerator(**melgan_config["generator_params"])
vocoder.set_state_dict(
paddle.load(args.melgan_checkpoint)["generator_params"])
vocoder.remove_weight_norm()
vocoder.eval()
print("model done!")
frontend = Frontend(phone_vocab_path=args.phones_dict)
print("frontend done!")
stat = np.load(args.fastspeech2_stat)
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
fastspeech2_normalizer = ZScore(mu, std)
stat = np.load(args.melgan_stat)
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
pwg_normalizer = ZScore(mu, std)
fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)
fastspeech2_inference.eval()
fastspeech2_inference = jit.to_static(
fastspeech2_inference, input_spec=[InputSpec([-1], dtype=paddle.int64)])
paddle.jit.save(fastspeech2_inference,
os.path.join(args.inference_dir, "fastspeech2"))
fastspeech2_inference = paddle.jit.load(
os.path.join(args.inference_dir, "fastspeech2"))
mb_melgan_inference = MelGANInference(pwg_normalizer, vocoder)
mb_melgan_inference.eval()
mb_melgan_inference = jit.to_static(
mb_melgan_inference,
input_spec=[
InputSpec([-1, 80], dtype=paddle.float32),
])
paddle.jit.save(mb_melgan_inference,
os.path.join(args.inference_dir, "mb_melgan"))
mb_melgan_inference = paddle.jit.load(
os.path.join(args.inference_dir, "mb_melgan"))
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
for utt_id, sentence in sentences:
input_ids = frontend.get_input_ids(sentence, merge_sentences=True)
phone_ids = input_ids["phone_ids"]
flags = 0
for part_phone_ids in phone_ids:
with paddle.no_grad():
mel = fastspeech2_inference(part_phone_ids)
temp_wav = mb_melgan_inference(mel)
if flags == 0:
wav = temp_wav
flags = 1
else:
wav = paddle.concat([wav, temp_wav])
sf.write(
str(output_dir / (utt_id + ".wav")),
wav.numpy(),
samplerate=fastspeech2_config.fs)
print(f"{utt_id} done!")
def main():
# parse args and config and redirect to train_sp
parser = argparse.ArgumentParser(
description="Synthesize with fastspeech2 & parallel wavegan.")
parser.add_argument(
"--fastspeech2-config", type=str, help="fastspeech2 config file.")
parser.add_argument(
"--fastspeech2-checkpoint",
type=str,
help="fastspeech2 checkpoint to load.")
parser.add_argument(
"--fastspeech2-stat",
type=str,
help="mean and standard deviation used to normalize spectrogram when training fastspeech2."
)
parser.add_argument(
"--melgan-config", type=str, help="parallel wavegan config file.")
parser.add_argument(
"--melgan-checkpoint",
type=str,
help="parallel wavegan generator parameters to load.")
parser.add_argument(
"--melgan-stat",
type=str,
help="mean and standard deviation used to normalize spectrogram when training parallel wavegan."
)
parser.add_argument(
"--phones-dict",
type=str,
default="phone_id_map.txt",
help="phone vocabulary file.")
parser.add_argument(
"--text",
type=str,
help="text to synthesize, a 'utt_id sentence' pair per line.")
parser.add_argument("--output-dir", type=str, help="output dir.")
parser.add_argument(
"--inference-dir", type=str, help="dir to save inference models")
parser.add_argument(
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
parser.add_argument("--verbose", type=int, default=1, help="verbose.")
args = parser.parse_args()
if args.ngpu == 0:
paddle.set_device("cpu")
elif args.ngpu > 0:
paddle.set_device("gpu")
else:
print("ngpu should >= 0 !")
with open(args.fastspeech2_config) as f:
fastspeech2_config = CfgNode(yaml.safe_load(f))
with open(args.melgan_config) as f:
melgan_config = CfgNode(yaml.safe_load(f))
print("========Args========")
print(yaml.safe_dump(vars(args)))
print("========Config========")
print(fastspeech2_config)
print(melgan_config)
evaluate(args, fastspeech2_config, melgan_config)
if __name__ == "__main__":
main()
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import logging
import os
import shutil
from pathlib import Path
import jsonlines
import numpy as np
import paddle
import yaml
from paddle import DataParallel
from paddle import distributed as dist
from paddle import nn
from paddle.io import DataLoader
from paddle.io import DistributedBatchSampler
from paddle.optimizer import Adam
from paddle.optimizer.lr import MultiStepDecay
from yacs.config import CfgNode
from paddlespeech.t2s.datasets.data_table import DataTable
from paddlespeech.t2s.datasets.vocoder_batch_fn import Clip
from paddlespeech.t2s.models.hifigan import HiFiGANEvaluator
from paddlespeech.t2s.models.hifigan import HiFiGANGenerator
from paddlespeech.t2s.models.hifigan import HiFiGANMultiScaleMultiPeriodDiscriminator
from paddlespeech.t2s.models.hifigan import HiFiGANUpdater
from paddlespeech.t2s.modules.losses import DiscriminatorAdversarialLoss
from paddlespeech.t2s.modules.losses import FeatureMatchLoss
from paddlespeech.t2s.modules.losses import GeneratorAdversarialLoss
from paddlespeech.t2s.modules.losses import MelSpectrogramLoss
from paddlespeech.t2s.training.extensions.snapshot import Snapshot
from paddlespeech.t2s.training.extensions.visualizer import VisualDL
from paddlespeech.t2s.training.seeding import seed_everything
from paddlespeech.t2s.training.trainer import Trainer
def train_sp(args, config):
# decides device type and whether to run in parallel
# setup running environment correctly
world_size = paddle.distributed.get_world_size()
if (not paddle.is_compiled_with_cuda()) or args.ngpu == 0:
paddle.set_device("cpu")
else:
paddle.set_device("gpu")
if world_size > 1:
paddle.distributed.init_parallel_env()
# set the random seed, it is a must for multiprocess training
seed_everything(config.seed)
print(
f"rank: {dist.get_rank()}, pid: {os.getpid()}, parent_pid: {os.getppid()}",
)
# dataloader has been too verbose
logging.getLogger("DataLoader").disabled = True
# construct dataset for training and validation
with jsonlines.open(args.train_metadata, 'r') as reader:
train_metadata = list(reader)
train_dataset = DataTable(
data=train_metadata,
fields=["wave", "feats"],
converters={
"wave": np.load,
"feats": np.load,
}, )
with jsonlines.open(args.dev_metadata, 'r') as reader:
dev_metadata = list(reader)
dev_dataset = DataTable(
data=dev_metadata,
fields=["wave", "feats"],
converters={
"wave": np.load,
"feats": np.load,
}, )
# collate function and dataloader
train_sampler = DistributedBatchSampler(
train_dataset,
batch_size=config.batch_size,
shuffle=True,
drop_last=True)
dev_sampler = DistributedBatchSampler(
dev_dataset,
batch_size=config.batch_size,
shuffle=False,
drop_last=False)
print("samplers done!")
if "aux_context_window" in config.generator_params:
aux_context_window = config.generator_params.aux_context_window
else:
aux_context_window = 0
train_batch_fn = Clip(
batch_max_steps=config.batch_max_steps,
hop_size=config.n_shift,
aux_context_window=aux_context_window)
train_dataloader = DataLoader(
train_dataset,
batch_sampler=train_sampler,
collate_fn=train_batch_fn,
num_workers=config.num_workers)
dev_dataloader = DataLoader(
dev_dataset,
batch_sampler=dev_sampler,
collate_fn=train_batch_fn,
num_workers=config.num_workers)
print("dataloaders done!")
generator = HiFiGANGenerator(**config["generator_params"])
discriminator = HiFiGANMultiScaleMultiPeriodDiscriminator(
**config["discriminator_params"])
if world_size > 1:
generator = DataParallel(generator)
discriminator = DataParallel(discriminator)
print("models done!")
criterion_feat_match = FeatureMatchLoss(**config["feat_match_loss_params"])
criterion_mel = MelSpectrogramLoss(
fs=config.fs,
fft_size=config.n_fft,
hop_size=config.n_shift,
win_length=config.win_length,
window=config.window,
num_mels=config.n_mels,
fmin=config.fmin,
fmax=config.fmax, )
criterion_gen_adv = GeneratorAdversarialLoss(
**config["generator_adv_loss_params"])
criterion_dis_adv = DiscriminatorAdversarialLoss(
**config["discriminator_adv_loss_params"])
print("criterions done!")
lr_schedule_g = MultiStepDecay(**config["generator_scheduler_params"])
# Compared to multi_band_melgan.v1 config, Adam optimizer without gradient norm is used
generator_grad_norm = config["generator_grad_norm"]
gradient_clip_g = nn.ClipGradByGlobalNorm(
generator_grad_norm) if generator_grad_norm > 0 else None
print("gradient_clip_g:", gradient_clip_g)
optimizer_g = Adam(
learning_rate=lr_schedule_g,
grad_clip=gradient_clip_g,
parameters=generator.parameters(),
**config["generator_optimizer_params"])
lr_schedule_d = MultiStepDecay(**config["discriminator_scheduler_params"])
discriminator_grad_norm = config["discriminator_grad_norm"]
gradient_clip_d = nn.ClipGradByGlobalNorm(
discriminator_grad_norm) if discriminator_grad_norm > 0 else None
print("gradient_clip_d:", gradient_clip_d)
optimizer_d = Adam(
learning_rate=lr_schedule_d,
grad_clip=gradient_clip_d,
parameters=discriminator.parameters(),
**config["discriminator_optimizer_params"])
print("optimizers done!")
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
if dist.get_rank() == 0:
config_name = args.config.split("/")[-1]
# copy conf to output_dir
shutil.copyfile(args.config, output_dir / config_name)
updater = HiFiGANUpdater(
models={
"generator": generator,
"discriminator": discriminator,
},
optimizers={
"generator": optimizer_g,
"discriminator": optimizer_d,
},
criterions={
"mel": criterion_mel,
"feat_match": criterion_feat_match,
"gen_adv": criterion_gen_adv,
"dis_adv": criterion_dis_adv,
},
schedulers={
"generator": lr_schedule_g,
"discriminator": lr_schedule_d,
},
dataloader=train_dataloader,
discriminator_train_start_steps=config.discriminator_train_start_steps,
# only hifigan have generator_train_start_steps
generator_train_start_steps=config.generator_train_start_steps,
lambda_adv=config.lambda_adv,
lambda_aux=config.lambda_aux,
lambda_feat_match=config.lambda_feat_match,
output_dir=output_dir)
evaluator = HiFiGANEvaluator(
models={
"generator": generator,
"discriminator": discriminator,
},
criterions={
"mel": criterion_mel,
"feat_match": criterion_feat_match,
"gen_adv": criterion_gen_adv,
"dis_adv": criterion_dis_adv,
},
dataloader=dev_dataloader,
lambda_adv=config.lambda_adv,
lambda_aux=config.lambda_aux,
lambda_feat_match=config.lambda_feat_match,
output_dir=output_dir)
trainer = Trainer(
updater,
stop_trigger=(config.train_max_steps, "iteration"),
out=output_dir)
if dist.get_rank() == 0:
trainer.extend(
evaluator, trigger=(config.eval_interval_steps, 'iteration'))
trainer.extend(VisualDL(output_dir), trigger=(1, 'iteration'))
trainer.extend(
Snapshot(max_size=config.num_snapshots),
trigger=(config.save_interval_steps, 'iteration'))
print("Trainer Done!")
trainer.run()
def main():
# parse args and config and redirect to train_sp
parser = argparse.ArgumentParser(
description="Train a HiFiGAN model.")
parser.add_argument(
"--config", type=str, help="config file to overwrite default config.")
parser.add_argument("--train-metadata", type=str, help="training data.")
parser.add_argument("--dev-metadata", type=str, help="dev data.")
parser.add_argument("--output-dir", type=str, help="output dir.")
parser.add_argument(
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
parser.add_argument("--verbose", type=int, default=1, help="verbose.")
args = parser.parse_args()
with open(args.config, 'rt') as f:
config = CfgNode(yaml.safe_load(f))
print("========Args========")
print(yaml.safe_dump(vars(args)))
print("========Config========")
print(config)
print(
f"master see the word size: {dist.get_world_size()}, from pid: {os.getpid()}"
)
# dispatch
if args.ngpu > 1:
dist.spawn(train_sp, (args, config), nprocs=args.ngpu)
else:
train_sp(args, config)
if __name__ == "__main__":
main()
...@@ -84,6 +84,7 @@ def main(): ...@@ -84,6 +84,7 @@ def main():
generator.set_state_dict(state_dict["generator_params"]) generator.set_state_dict(state_dict["generator_params"])
generator.remove_weight_norm() generator.remove_weight_norm()
generator.eval() generator.eval()
with jsonlines.open(args.test_metadata, 'r') as reader: with jsonlines.open(args.test_metadata, 'r') as reader:
metadata = list(reader) metadata = list(reader)
test_dataset = DataTable( test_dataset = DataTable(
......
...@@ -11,8 +11,8 @@ ...@@ -11,8 +11,8 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# remain for chains
import argparse import argparse
import os
from pathlib import Path from pathlib import Path
import soundfile as sf import soundfile as sf
...@@ -31,8 +31,6 @@ def main(): ...@@ -31,8 +31,6 @@ def main():
type=str, type=str,
help="text to synthesize, a 'utt_id sentence' pair per line") help="text to synthesize, a 'utt_id sentence' pair per line")
parser.add_argument("--output-dir", type=str, help="output dir") parser.add_argument("--output-dir", type=str, help="output dir")
parser.add_argument(
"--enable-auto-log", action="store_true", help="use auto log")
parser.add_argument( parser.add_argument(
"--phones-dict", "--phones-dict",
type=str, type=str,
...@@ -64,23 +62,6 @@ def main(): ...@@ -64,23 +62,6 @@ def main():
pwg_config.enable_memory_optim() pwg_config.enable_memory_optim()
pwg_predictor = inference.create_predictor(pwg_config) pwg_predictor = inference.create_predictor(pwg_config)
if args.enable_auto_log:
import auto_log
os.makedirs("output", exist_ok=True)
pid = os.getpid()
logger = auto_log.AutoLogger(
model_name="speedyspeech",
model_precision='float32',
batch_size=1,
data_shape="dynamic",
save_path="./output/auto_log.log",
inference_config=speedyspeech_config,
pids=pid,
process_name=None,
gpu_ids=0,
time_keys=['preprocess_time', 'inference_time', 'postprocess_time'],
warmup=0)
output_dir = Path(args.output_dir) output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True) output_dir.mkdir(parents=True, exist_ok=True)
sentences = [] sentences = []
...@@ -93,9 +74,6 @@ def main(): ...@@ -93,9 +74,6 @@ def main():
sentences.append((utt_id, sentence)) sentences.append((utt_id, sentence))
for utt_id, sentence in sentences: for utt_id, sentence in sentences:
if args.enable_auto_log:
logger.times.start()
input_ids = frontend.get_input_ids( input_ids = frontend.get_input_ids(
sentence, merge_sentences=True, get_tone_ids=True) sentence, merge_sentences=True, get_tone_ids=True)
phone_ids = input_ids["phone_ids"] phone_ids = input_ids["phone_ids"]
...@@ -103,9 +81,6 @@ def main(): ...@@ -103,9 +81,6 @@ def main():
phones = phone_ids[0].numpy() phones = phone_ids[0].numpy()
tones = tone_ids[0].numpy() tones = tone_ids[0].numpy()
if args.enable_auto_log:
logger.times.stamp()
input_names = speedyspeech_predictor.get_input_names() input_names = speedyspeech_predictor.get_input_names()
phones_handle = speedyspeech_predictor.get_input_handle(input_names[0]) phones_handle = speedyspeech_predictor.get_input_handle(input_names[0])
tones_handle = speedyspeech_predictor.get_input_handle(input_names[1]) tones_handle = speedyspeech_predictor.get_input_handle(input_names[1])
...@@ -131,18 +106,10 @@ def main(): ...@@ -131,18 +106,10 @@ def main():
output_handle = pwg_predictor.get_output_handle(output_names[0]) output_handle = pwg_predictor.get_output_handle(output_names[0])
wav = output_data = output_handle.copy_to_cpu() wav = output_data = output_handle.copy_to_cpu()
if args.enable_auto_log:
logger.times.stamp()
sf.write(output_dir / (utt_id + ".wav"), wav, samplerate=24000) sf.write(output_dir / (utt_id + ".wav"), wav, samplerate=24000)
if args.enable_auto_log:
logger.times.end(stamp=True)
print(f"{utt_id} done!") print(f"{utt_id} done!")
if args.enable_auto_log:
logger.report()
if __name__ == "__main__": if __name__ == "__main__":
main() main()
...@@ -11,6 +11,7 @@ ...@@ -11,6 +11,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# remain for chains
import argparse import argparse
import logging import logging
import os import os
......
...@@ -161,7 +161,7 @@ def train_sp(args, config): ...@@ -161,7 +161,7 @@ def train_sp(args, config):
def main(): def main():
# parse args and config and redirect to train_sp # parse args and config and redirect to train_sp
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
description="Train a Speedyspeech model with sigle speaker dataset.") description="Train a Speedyspeech model with a single speaker dataset.")
parser.add_argument("--config", type=str, help="config file.") parser.add_argument("--config", type=str, help="config file.")
parser.add_argument("--train-metadata", type=str, help="training data.") parser.add_argument("--train-metadata", type=str, help="training data.")
parser.add_argument("--dev-metadata", type=str, help="dev data.") parser.add_argument("--dev-metadata", type=str, help="dev data.")
......
此差异已折叠。
...@@ -12,6 +12,7 @@ ...@@ -12,6 +12,7 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
from .fastspeech2 import * from .fastspeech2 import *
from .hifigan import *
from .melgan import * from .melgan import *
from .parallel_wavegan import * from .parallel_wavegan import *
from .speedyspeech import * from .speedyspeech import *
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .hifigan import *
from .hifigan_updater import *
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
...@@ -60,7 +60,8 @@ class ResidualStack(nn.Layer): ...@@ -60,7 +60,8 @@ class ResidualStack(nn.Layer):
""" """
super().__init__() super().__init__()
# for compatibility # for compatibility
nonlinear_activation = nonlinear_activation.lower() if nonlinear_activation:
nonlinear_activation = nonlinear_activation.lower()
# defile residual stack part # defile residual stack part
if not use_causal_conv: if not use_causal_conv:
......
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册