diff --git a/examples/aishell3/tts3/README.md b/examples/aishell3/tts3/README.md index 4953b8c94c3b6f8c6e923c179d1174095c3f0a70..8d1c2aa9cd896f4741b850fb87b8919784b57a22 100644 --- a/examples/aishell3/tts3/README.md +++ b/examples/aishell3/tts3/README.md @@ -1,7 +1,7 @@ # FastSpeech2 with AISHELL-3 This example contains code used to train a [Fastspeech2](https://arxiv.org/abs/2006.04558) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). -AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. +AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus that could be used to train multi-speaker Text-to-Speech (TTS) systems. We use AISHELL-3 to train a multi-speaker fastspeech2 model here. ## Dataset @@ -17,7 +17,7 @@ tar zxvf data_aishell3.tgz -C data_aishell3 ``` ### Get MFA Result and Extract We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2. -You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo. +You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo. ## Get Started Assume the path to the dataset is `~/datasets/data_aishell3`. @@ -32,7 +32,7 @@ Run the command below to ```bash ./run.sh ``` -You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset. ```bash ./run.sh --stage 0 --stop-stage 0 ``` @@ -59,9 +59,9 @@ dump ├── raw └── speech_stats.npy ``` -The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains speech、pitch and energy features of each utterances, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`. +The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech、pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`. -Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, durations, path of speech features, path of pitch features, path of energy features, speaker and id of each utterance. +Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, the path of pitch features, a path of energy features, speaker, and id of each utterance. ### Model Training `./local/train.sh` calls `${BIN_DIR}/train.py`. @@ -95,14 +95,14 @@ optional arguments: ``` 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. -3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. +3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 5. `--phones-dict` is the path of the phone vocabulary file. -6. `--speaker-dict`is the path of the speaker id map file when training a multi-speaker FastSpeech2. +6. `--speaker-dict` is the path of the speaker id map file when training a multi-speaker FastSpeech2. ### Synthesizing We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1) as the neural vocoder. -Download pretrained parallel wavegan model from [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip) and unzip it. +Download the pretrained parallel wavegan model from [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip) and unzip it. ```bash unzip pwg_aishell3_ckpt_0.5.zip ``` @@ -113,98 +113,115 @@ pwg_aishell3_ckpt_0.5 ├── feats_stats.npy # statistics used to normalize spectrogram when training parallel wavegan └── snapshot_iter_1000000.pdz # generator parameters of parallel wavegan ``` -`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`. +`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} ``` ```text -usage: synthesize.py [-h] [--fastspeech2-config FASTSPEECH2_CONFIG] - [--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT] - [--fastspeech2-stat FASTSPEECH2_STAT] - [--pwg-config PWG_CONFIG] - [--pwg-checkpoint PWG_CHECKPOINT] [--pwg-stat PWG_STAT] - [--phones-dict PHONES_DICT] [--speaker-dict SPEAKER_DICT] - [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR] - [--ngpu NGPU] [--verbose VERBOSE] +usage: synthesize.py [-h] + [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}] + [--am_config AM_CONFIG] [--am_ckpt AM_CKPT] + [--am_stat AM_STAT] [--phones_dict PHONES_DICT] + [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT] + [--voice-cloning VOICE_CLONING] + [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}] + [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] + [--voc_stat VOC_STAT] [--ngpu NGPU] + [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR] -Synthesize with fastspeech2 & parallel wavegan. +Synthesize with acoustic model & vocoder optional arguments: -h, --help show this help message and exit - --fastspeech2-config FASTSPEECH2_CONFIG - fastspeech2 config file. - --fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT - fastspeech2 checkpoint to load. - --fastspeech2-stat FASTSPEECH2_STAT - mean and standard deviation used to normalize - spectrogram when training fastspeech2. - --pwg-config PWG_CONFIG - parallel wavegan config file. - --pwg-checkpoint PWG_CHECKPOINT - parallel wavegan generator parameters to load. - --pwg-stat PWG_STAT mean and standard deviation used to normalize - spectrogram when training parallel wavegan. - --phones-dict PHONES_DICT + --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk} + Choose acoustic model type of tts task. + --am_config AM_CONFIG + Config of acoustic model. Use deault config when it is + None. + --am_ckpt AM_CKPT Checkpoint file of acoustic model. + --am_stat AM_STAT mean and standard deviation used to normalize + spectrogram when training acoustic model. + --phones_dict PHONES_DICT phone vocabulary file. - --speaker-dict SPEAKER_DICT - speaker id map file for multiple speaker model. - --test-metadata TEST_METADATA + --tones_dict TONES_DICT + tone vocabulary file. + --speaker_dict SPEAKER_DICT + speaker id map file. + --voice-cloning VOICE_CLONING + whether training voice cloning model. + --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc} + Choose vocoder type of tts task. + --voc_config VOC_CONFIG + Config of voc. Use deault config when it is None. + --voc_ckpt VOC_CKPT Checkpoint file of voc. + --voc_stat VOC_STAT mean and standard deviation used to normalize + spectrogram when training voc. + --ngpu NGPU if ngpu == 0, use cpu. + --test_metadata TEST_METADATA test metadata. - --output-dir OUTPUT_DIR + --output_dir OUTPUT_DIR output dir. - --ngpu NGPU if ngpu == 0, use cpu. - --verbose VERBOSE verbose ``` -`./local/synthesize_e2e.sh` calls `${BIN_DIR}/multi_spk_synthesize_e2e.py`, which can synthesize waveform from text file. +`./local/synthesize_e2e.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} ``` ```text -usage: multi_spk_synthesize_e2e.py [-h] - [--fastspeech2-config FASTSPEECH2_CONFIG] - [--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT] - [--fastspeech2-stat FASTSPEECH2_STAT] - [--pwg-config PWG_CONFIG] - [--pwg-checkpoint PWG_CHECKPOINT] - [--pwg-stat PWG_STAT] - [--phones-dict PHONES_DICT] - [--speaker-dict SPEAKER_DICT] [--text TEXT] - [--output-dir OUTPUT_DIR] [--ngpu NGPU] - [--verbose VERBOSE] - -Synthesize with fastspeech2 & parallel wavegan. +usage: synthesize_e2e.py [-h] + [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}] + [--am_config AM_CONFIG] [--am_ckpt AM_CKPT] + [--am_stat AM_STAT] [--phones_dict PHONES_DICT] + [--tones_dict TONES_DICT] + [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID] + [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}] + [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] + [--voc_stat VOC_STAT] [--lang LANG] + [--inference_dir INFERENCE_DIR] [--ngpu NGPU] + [--text TEXT] [--output_dir OUTPUT_DIR] + +Synthesize with acoustic model & vocoder optional arguments: -h, --help show this help message and exit - --fastspeech2-config FASTSPEECH2_CONFIG - fastspeech2 config file. - --fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT - fastspeech2 checkpoint to load. - --fastspeech2-stat FASTSPEECH2_STAT - mean and standard deviation used to normalize - spectrogram when training fastspeech2. - --pwg-config PWG_CONFIG - parallel wavegan config file. - --pwg-checkpoint PWG_CHECKPOINT - parallel wavegan generator parameters to load. - --pwg-stat PWG_STAT mean and standard deviation used to normalize - spectrogram when training parallel wavegan. - --phones-dict PHONES_DICT + --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk} + Choose acoustic model type of tts task. + --am_config AM_CONFIG + Config of acoustic model. Use deault config when it is + None. + --am_ckpt AM_CKPT Checkpoint file of acoustic model. + --am_stat AM_STAT mean and standard deviation used to normalize + spectrogram when training acoustic model. + --phones_dict PHONES_DICT phone vocabulary file. - --speaker-dict SPEAKER_DICT + --tones_dict TONES_DICT + tone vocabulary file. + --speaker_dict SPEAKER_DICT speaker id map file. + --spk_id SPK_ID spk id for multi speaker acoustic model + --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc} + Choose vocoder type of tts task. + --voc_config VOC_CONFIG + Config of voc. Use deault config when it is None. + --voc_ckpt VOC_CKPT Checkpoint file of voc. + --voc_stat VOC_STAT mean and standard deviation used to normalize + spectrogram when training voc. + --lang LANG Choose model language. zh or en + --inference_dir INFERENCE_DIR + dir to save inference models + --ngpu NGPU if ngpu == 0, use cpu. --text TEXT text to synthesize, a 'utt_id sentence' pair per line. - --output-dir OUTPUT_DIR + --output_dir OUTPUT_DIR output dir. - --ngpu NGPU if ngpu == 0, use cpu. - --verbose VERBOSE verbose. ``` -1. `--fastspeech2-config`, `--fastspeech2-checkpoint`, `--fastspeech2-stat`, `--phones-dict` and `--speaker-dict` are arguments for fastspeech2, which correspond to the 5 files in the fastspeech2 pretrained model. -2. `--pwg-config`, `--pwg-checkpoint`, `--pwg-stat` are arguments for parallel wavegan, which correspond to the 3 files in the parallel wavegan pretrained model. -3. `--test-metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder. -4. `--text` is the text file, which contains sentences to synthesize. -5. `--output-dir` is the directory to save synthesized audio files. -6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. +1. `--am` is acoustic model type with the format {model_name}_{dataset} +2. `--am_config`, `--am_checkpoint`, `--am_stat`, `--phones_dict` `--speaker_dict` are arguments for acoustic model, which correspond to the 5 files in the fastspeech2 pretrained model. +3. `--voc` is vocoder type with the format {model_name}_{dataset} +4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model. +5. `--lang` is the model language, which can be `zh` or `en`. +6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder. +7. `--text` is the text file, which contains sentences to synthesize. +8. `--output_dir` is the directory to save synthesized audio files. +9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ## Pretrained Model Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_ckpt_0.4.zip) @@ -225,16 +242,20 @@ source path.sh FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ -python3 ${BIN_DIR}/multi_spk_synthesize_e2e.py \ - --fastspeech2-config=fastspeech2_nosil_aishell3_ckpt_0.4/default.yaml \ - --fastspeech2-checkpoint=fastspeech2_nosil_aishell3_ckpt_0.4/snapshot_iter_96400.pdz \ - --fastspeech2-stat=fastspeech2_nosil_aishell3_ckpt_0.4/speech_stats.npy \ - --pwg-config=pwg_aishell3_ckpt_0.5/default.yaml \ - --pwg-checkpoint=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ - --pwg-stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ +python3 ${BIN_DIR}/../synthesize_e2e.py \ + --am=fastspeech2_aishell3 \ + --am_config=fastspeech2_nosil_aishell3_ckpt_0.4/default.yaml \ + --am_ckpt=fastspeech2_nosil_aishell3_ckpt_0.4/snapshot_iter_96400.pdz \ + --am_stat=fastspeech2_nosil_aishell3_ckpt_0.4/speech_stats.npy \ + --voc=pwgan_aishell3 \ + --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \ + --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ + --voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ + --lang=zh \ --text=${BIN_DIR}/../sentences.txt \ - --output-dir=exp/default/test_e2e \ - --phones-dict=fastspeech2_nosil_aishell3_ckpt_0.4/phone_id_map.txt \ - --speaker-dict=fastspeech2_nosil_aishell3_ckpt_0.4/speaker_id_map.txt + --output_dir=exp/default/test_e2e \ + --phones_dict=fastspeech2_nosil_aishell3_ckpt_0.4/phone_id_map.txt \ + --speaker_dict=fastspeech2_nosil_aishell3_ckpt_0.4/speaker_id_map.txt \ + --spk_id=0 ``` diff --git a/examples/aishell3/tts3/conf/default.yaml b/examples/aishell3/tts3/conf/default.yaml index 0159c12f97a682e26447cf5aa044683beeba6409..3a57e902607f4de641dc36960ad928d590dc5fae 100644 --- a/examples/aishell3/tts3/conf/default.yaml +++ b/examples/aishell3/tts3/conf/default.yaml @@ -3,9 +3,9 @@ ########################################################### fs: 24000 # sr -n_fft: 2048 # FFT size. -n_shift: 300 # Hop size. -win_length: 1200 # Window length. +n_fft: 2048 # FFT size (samples). +n_shift: 300 # Hop size (samples). 12.5ms +win_length: 1200 # Window length (samples). 50ms # If set to null, it will be the same as fft_size. window: "hann" # Window function. diff --git a/examples/aishell3/tts3/local/synthesize.sh b/examples/aishell3/tts3/local/synthesize.sh index db079410cc878a778ad8af35ff6a66ee694f2efa..b1fc96a2d67ee1002f5c99a58e52fe30c1e5b621 100755 --- a/examples/aishell3/tts3/local/synthesize.sh +++ b/examples/aishell3/tts3/local/synthesize.sh @@ -6,14 +6,16 @@ ckpt_name=$3 FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ -python3 ${BIN_DIR}/synthesize.py \ - --fastspeech2-config=${config_path} \ - --fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ - --fastspeech2-stat=dump/train/speech_stats.npy \ - --pwg-config=pwg_aishell3_ckpt_0.5/default.yaml \ - --pwg-checkpoint=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ - --pwg-stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ - --test-metadata=dump/test/norm/metadata.jsonl \ - --output-dir=${train_output_path}/test \ - --phones-dict=dump/phone_id_map.txt \ - --speaker-dict=dump/speaker_id_map.txt +python3 ${BIN_DIR}/../synthesize.py \ + --am=fastspeech2_aishell3 \ + --am_config=${config_path} \ + --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --am_stat=dump/train/speech_stats.npy \ + --voc=pwgan_aishell3 \ + --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \ + --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ + --voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ + --test_metadata=dump/test/norm/metadata.jsonl \ + --output_dir=${train_output_path}/test \ + --phones_dict=dump/phone_id_map.txt \ + --speaker_dict=dump/speaker_id_map.txt diff --git a/examples/aishell3/tts3/local/synthesize_e2e.sh b/examples/aishell3/tts3/local/synthesize_e2e.sh index 1a6e47e7114468b3562fb7419a694eab35d2adf0..d0d92585985ab2832da0aa0060e14a8ab4bdcfd2 100755 --- a/examples/aishell3/tts3/local/synthesize_e2e.sh +++ b/examples/aishell3/tts3/local/synthesize_e2e.sh @@ -6,14 +6,18 @@ ckpt_name=$3 FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ -python3 ${BIN_DIR}/multi_spk_synthesize_e2e.py \ - --fastspeech2-config=${config_path} \ - --fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ - --fastspeech2-stat=dump/train/speech_stats.npy \ - --pwg-config=pwg_aishell3_ckpt_0.5/default.yaml \ - --pwg-checkpoint=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ - --pwg-stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ - --text=${BIN_DIR}/../sentences.txt \ - --output-dir=${train_output_path}/test_e2e \ - --phones-dict=dump/phone_id_map.txt \ - --speaker-dict=dump/speaker_id_map.txt +python3 ${BIN_DIR}/../synthesize_e2e.py \ + --am=fastspeech2_aishell3 \ + --am_config=${config_path} \ + --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --am_stat=dump/train/speech_stats.npy \ + --voc=pwgan_aishell3 \ + --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \ + --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ + --voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ + --lang=zh \ + --text=${BIN_DIR}/../sentences.txt \ + --output_dir=${train_output_path}/test_e2e \ + --phones_dict=dump/phone_id_map.txt \ + --speaker_dict=dump/speaker_id_map.txt \ + --spk_id=0 diff --git a/examples/aishell3/vc0/README.md b/examples/aishell3/vc0/README.md index 93cb08bd06251a1323187e932d7fb9b896f6d18f..91d32619bfd2dc6328f0cb7df73e229ef1859e88 100644 --- a/examples/aishell3/vc0/README.md +++ b/examples/aishell3/vc0/README.md @@ -1,6 +1,6 @@ # Tacotron2 + AISHELL-3 Voice Cloning -This example contains code used to train a [Tacotron2 ](https://arxiv.org/abs/1712.05884) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf) . The general steps are as follows: -1. Speaker Encoder: We use a Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in Tacotron2, because the transcriptions are not needed, we use more datasets, refer to [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e). +This example contains code used to train a [Tacotron2 ](https://arxiv.org/abs/1712.05884) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf). The general steps are as follows: +1. Speaker Encoder: We use Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in Tacotron2 because the transcriptions are not needed, we use more datasets, refer to [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e). 2. Synthesizer: We use the trained speaker encoder to generate speaker embedding for each sentence in AISHELL-3. This embedding is an extra input of Tacotron2 which will be concated with encoder outputs. 3. Vocoder: We use WaveFlow as the neural Vocoder, refer to [waveflow](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0). @@ -39,13 +39,13 @@ fi The computing time of utterance embedding can be x hours. #### Process Wav -There are silence in the edge of AISHELL-3's wavs, and the audio amplitude is very small, so, we need to remove the silence and normalize the audio. You can the silence remove method based on volume or energy, but the effect is not very good, We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get the alignment of text and speech, then utilize the alignment results to remove the silence. +There is silence in the edge of AISHELL-3's wavs, and the audio amplitude is very small, so, we need to remove the silence and normalize the audio. You can the silence remove method based on volume or energy, but the effect is not very good, We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get the alignment of text and speech, then utilize the alignment results to remove the silence. -We use Montreal Force Aligner 1.0. The label in aishell3 include pinyin,so the lexicon we provided to MFA is pinyin rather than Chinese characters. And the prosody marks(`$` and `%`) need to be removed. You shoud preprocess the dataset into the format which MFA needs, the texts have the same name with wavs and have the suffix `.lab`. +We use Montreal Force Aligner 1.0. The label in aishell3 includes pinyin,so the lexicon we provided to MFA is pinyin rather than Chinese characters. And the prosody marks(`$` and `%`) need to be removed. You should preprocess the dataset into the format which MFA needs, the texts have the same name with wavs and have the suffix `.lab`. We use [lexicon.txt](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/paddlespeech/t2s/exps/voice_cloning/tacotron2_ge2e/lexicon.txt) as the lexicon. -You can download the alignment results from here [alignment_aishell3.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/alignment_aishell3.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo. +You can download the alignment results from here [alignment_aishell3.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/alignment_aishell3.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo. ```bash @@ -83,9 +83,9 @@ fi CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${preprocess_path} ${train_output_path} ``` -Our model remve stop token prediction in Tacotron2, because of the problem of extremely unbalanced proportion of positive and negative samples of stop token prediction, and it's very sensitive to the clip of audio silence. We use the last symbol from the highest point of attention to the encoder side as the termination condition. +Our model removes stop token prediction in Tacotron2, because of the problem of the extremely unbalanced proportion of positive and negative samples of stop token prediction, and it's very sensitive to the clip of audio silence. We use the last symbol from the highest point of attention to the encoder side as the termination condition. -In addition, in order to accelerate the convergence of the model, we add `guided attention loss` to induce the alignment between encoder and decoder to show diagonal lines faster. +In addition, to accelerate the convergence of the model, we add `guided attention loss` to induce the alignment between encoder and decoder to show diagonal lines faster. ### Voice Cloning ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${ge2e_params_path} ${tacotron2_params_path} ${waveflow_params_path} ${vc_input} ${vc_output} diff --git a/examples/aishell3/vc0/local/voice_cloning.sh b/examples/aishell3/vc0/local/voice_cloning.sh index ee96b9e0dbaca0336626936379e3af54fa30c0e3..3fe3de767dce9e3ca4f0b990eb2af9fa0e2aad7d 100755 --- a/examples/aishell3/vc0/local/voice_cloning.sh +++ b/examples/aishell3/vc0/local/voice_cloning.sh @@ -7,8 +7,8 @@ vc_input=$4 vc_output=$5 python3 ${BIN_DIR}/voice_cloning.py \ - --ge2e_params_path=${ge2e_params_path} \ - --tacotron2_params_path=${tacotron2_params_path} \ - --waveflow_params_path=${waveflow_params_path} \ - --input-dir=${vc_input} \ - --output-dir=${vc_output} \ No newline at end of file + --ge2e_params_path=${ge2e_params_path} \ + --tacotron2_params_path=${tacotron2_params_path} \ + --waveflow_params_path=${waveflow_params_path} \ + --input-dir=${vc_input} \ + --output-dir=${vc_output} \ No newline at end of file diff --git a/examples/aishell3/vc1/README.md b/examples/aishell3/vc1/README.md index 676678c999b9f0b08d963483f09fe22279da879b..d5745bc32cef8b54894b807c39767fb64246665b 100644 --- a/examples/aishell3/vc1/README.md +++ b/examples/aishell3/vc1/README.md @@ -1,7 +1,7 @@ # FastSpeech2 + AISHELL-3 Voice Cloning -This example contains code used to train a [FastSpeech2](https://arxiv.org/abs/2006.04558) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf) . The general steps are as follows: -1. Speaker Encoder: We use a Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in `FastSpeech2`, because the transcriptions are not needed, we use more datasets, refer to [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e). +This example contains code used to train a [FastSpeech2](https://arxiv.org/abs/2006.04558) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf). The general steps are as follows: +1. Speaker Encoder: We use Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in `FastSpeech2` because the transcriptions are not needed, we use more datasets, refer to [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e). 2. Synthesizer: We use the trained speaker encoder to generate speaker embedding for each sentence in AISHELL-3. This embedding is an extra input of `FastSpeech2` which will be concated with encoder outputs. 3. Vocoder: We use [Parallel Wave GAN](http://arxiv.org/abs/1910.11480) as the neural Vocoder, refer to [voc1](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1). @@ -18,7 +18,7 @@ tar zxvf data_aishell3.tgz -C data_aishell3 ``` ### Get MFA Result and Extract We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2. -You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo. +You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo. ## Pretrained GE2E Model We use pretrained GE2E model to generate speaker embedding for each sentence. @@ -39,7 +39,7 @@ Run the command below to ```bash ./run.sh ``` -You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset. ```bash ./run.sh --stage 0 --stop-stage 0 ``` @@ -72,20 +72,20 @@ dump ``` The `embed` contains the generated speaker embedding for each sentence in AISHELL-3, which has the same file structure with wav files and the format is `.npy`. -The computing time of utterance embedding can be x hours. +The computing time of utterance embedding can be x hours. -The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains speech、pitch and energy features of each utterances, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`. +The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech、pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`. -Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, durations, path of speech features, path of pitch features, path of energy features, speaker and id of each utterance. +Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, the path of pitch features, the path of energy features, speaker, and id of each utterance. -The preprocessing step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but there is one more `ge2e/inference` step here. +The preprocessing step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but there is one more `ge2e/inference` step here. ### Model Training `./local/train.sh` calls `${BIN_DIR}/train.py`. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} ``` -The training step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but we should set `--voice-cloning=True` when calling `${BIN_DIR}/train.py`. +The training step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but we should set `--voice-cloning=True` when calling `${BIN_DIR}/train.py`. ### Synthesizing We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1) as the neural vocoder. @@ -100,11 +100,11 @@ pwg_aishell3_ckpt_0.5 ├── feats_stats.npy # statistics used to normalize spectrogram when training parallel wavegan └── snapshot_iter_1000000.pdz # generator parameters of parallel wavegan ``` -`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`. +`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} ``` -The synthesizing step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but we should set `--voice-cloning=True` when calling `${BIN_DIR}/synthesize.py`. +The synthesizing step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but we should set `--voice-cloning=True` when calling `${BIN_DIR}/../synthesize.py`. ### Voice Cloning Assume there are some reference audios in `./ref_audio` diff --git a/examples/aishell3/vc1/conf/default.yaml b/examples/aishell3/vc1/conf/default.yaml index 78c32525717eb7becde0f90271a26060c36e6644..557a5a0a1cd04b4cd4f5e2be57e71f42ba457078 100644 --- a/examples/aishell3/vc1/conf/default.yaml +++ b/examples/aishell3/vc1/conf/default.yaml @@ -3,9 +3,9 @@ ########################################################### fs: 24000 # sr -n_fft: 2048 # FFT size. -n_shift: 300 # Hop size. -win_length: 1200 # Window length. +n_fft: 2048 # FFT size (samples). +n_shift: 300 # Hop size (samples). 12.5ms +win_length: 1200 # Window length (samples). 50ms # If set to null, it will be the same as fft_size. window: "hann" # Window function. diff --git a/examples/aishell3/vc1/local/synthesize.sh b/examples/aishell3/vc1/local/synthesize.sh index 35478c784b76b76a5aeb8c57a9b3a7de6aa583c5..8c61e3f3e2c28f9a6160f5bfce41ef2fecede8df 100755 --- a/examples/aishell3/vc1/local/synthesize.sh +++ b/examples/aishell3/vc1/local/synthesize.sh @@ -6,14 +6,17 @@ ckpt_name=$3 FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ -python3 ${BIN_DIR}/synthesize.py \ - --fastspeech2-config=${config_path} \ - --fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ - --fastspeech2-stat=dump/train/speech_stats.npy \ - --pwg-config=pwg_aishell3_ckpt_0.5/default.yaml \ - --pwg-checkpoint=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ - --pwg-stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ - --test-metadata=dump/test/norm/metadata.jsonl \ - --output-dir=${train_output_path}/test \ - --phones-dict=dump/phone_id_map.txt \ - --voice-cloning=True +python3 ${BIN_DIR}/../synthesize.py \ + --am=fastspeech2_aishell3 \ + --am_config=${config_path} \ + --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --am_stat=dump/train/speech_stats.npy \ + --voc=pwgan_aishell3 \ + --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \ + --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ + --voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ + --test_metadata=dump/test/norm/metadata.jsonl \ + --output_dir=${train_output_path}/test \ + --phones_dict=dump/phone_id_map.txt \ + --speaker_dict=dump/speaker_id_map.txt \ + --voice-cloning=True diff --git a/examples/aishell3/vc1/local/voice_cloning.sh b/examples/aishell3/vc1/local/voice_cloning.sh index 55bdd761ef845d7dc8084ce0cb80947f7b050656..6a50826e843c6d7bb983bd1d4cfdbf2767886126 100755 --- a/examples/aishell3/vc1/local/voice_cloning.sh +++ b/examples/aishell3/vc1/local/voice_cloning.sh @@ -9,14 +9,14 @@ ref_audio_dir=$5 FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ python3 ${BIN_DIR}/voice_cloning.py \ - --fastspeech2-config=${config_path} \ - --fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ - --fastspeech2-stat=dump/train/speech_stats.npy \ - --pwg-config=pwg_aishell3_ckpt_0.5/default.yaml \ - --pwg-checkpoint=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ - --pwg-stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ - --ge2e_params_path=${ge2e_params_path} \ - --text="凯莫瑞安联合体的经济崩溃迫在眉睫。" \ - --input-dir=${ref_audio_dir} \ - --output-dir=${train_output_path}/vc_syn \ - --phones-dict=dump/phone_id_map.txt + --fastspeech2-config=${config_path} \ + --fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ + --fastspeech2-stat=dump/train/speech_stats.npy \ + --pwg-config=pwg_aishell3_ckpt_0.5/default.yaml \ + --pwg-checkpoint=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ + --pwg-stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ + --ge2e_params_path=${ge2e_params_path} \ + --text="凯莫瑞安联合体的经济崩溃迫在眉睫。" \ + --input-dir=${ref_audio_dir} \ + --output-dir=${train_output_path}/vc_syn \ + --phones-dict=dump/phone_id_map.txt diff --git a/examples/aishell3/voc1/README.md b/examples/aishell3/voc1/README.md index de7e04a6fcc20aa313e42d808ca958637ebb561f..7da3946e3a6ff91a14850f36bca74cd8616738b3 100644 --- a/examples/aishell3/voc1/README.md +++ b/examples/aishell3/voc1/README.md @@ -1,7 +1,7 @@ # Parallel WaveGAN with AISHELL-3 This example contains code used to train a [parallel wavegan](http://arxiv.org/abs/1910.11480) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). -AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. +AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus that could be used to train multi-speaker Text-to-Speech (TTS) systems. ## Dataset ### Download and Extract Download AISHELL-3. @@ -15,7 +15,7 @@ tar zxvf data_aishell3.tgz -C data_aishell3 ``` ### Get MFA Result and Extract We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2. -You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo. +You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo. ## Get Started Assume the path to the dataset is `~/datasets/data_aishell3`. @@ -53,9 +53,9 @@ dump └── feats_stats.npy ``` -The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains log magnitude of mel spectrogram of each utterances, while the norm folder contains normalized spectrogram. The statistics used to normalize the spectrogram is computed from the training set, which is located in `dump/train/feats_stats.npy`. +The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`. -Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance. +Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance. ### Model Training ```bash @@ -101,7 +101,7 @@ benchmark: 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. -3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. +3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ### Synthesizing @@ -110,15 +110,19 @@ benchmark: CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} ``` ```text -usage: synthesize.py [-h] [--config CONFIG] [--checkpoint CHECKPOINT] - [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR] - [--ngpu NGPU] [--verbose VERBOSE] +usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG] + [--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA] + [--output-dir OUTPUT_DIR] [--ngpu NGPU] + [--verbose VERBOSE] -Synthesize with parallel wavegan. +Synthesize with GANVocoder. optional arguments: -h, --help show this help message and exit - --config CONFIG parallel wavegan config file. + --generator-type GENERATOR_TYPE + type of GANVocoder, should in {pwgan, mb_melgan, + style_melgan, } now + --config CONFIG GANVocoder config file. --checkpoint CHECKPOINT snapshot to load. --test-metadata TEST_METADATA diff --git a/examples/aishell3/voc1/conf/default.yaml b/examples/aishell3/voc1/conf/default.yaml index eb6d350d41af6af148e64551ac53dad30c77c0d6..88968d6fccb03efc8073edb3b03a4765e41640bb 100644 --- a/examples/aishell3/voc1/conf/default.yaml +++ b/examples/aishell3/voc1/conf/default.yaml @@ -7,9 +7,9 @@ # FEATURE EXTRACTION SETTING # ########################################################### fs: 24000 # Sampling rate. -n_fft: 2048 # FFT size. (in samples) -n_shift: 300 # Hop size. (in samples) -win_length: 1200 # Window length. (in samples) +n_fft: 2048 # FFT size (samples). +n_shift: 300 # Hop size (samples). 12.5ms +win_length: 1200 # Window length (samples). 50ms # If set to null, it will be the same as fft_size. window: "hann" # Window function. n_mels: 80 # Number of mel basis. @@ -49,9 +49,9 @@ discriminator_params: bias: true # Whether to use bias parameter in conv. use_weight_norm: true # Whether to use weight norm. # If set to true, it will be applied to all of the conv layers. - nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv. + nonlinear_activation: "leakyrelu" # Nonlinear function after each conv. nonlinear_activation_params: # Nonlinear function parameters - negative_slope: 0.2 # Alpha in LeakyReLU. + negative_slope: 0.2 # Alpha in leakyrelu. ########################################################### # STFT LOSS SETTING # diff --git a/examples/aishell3/voc1/local/synthesize.sh b/examples/aishell3/voc1/local/synthesize.sh index d85d1b1d970c613630e6288c5b33d88f7a475016..145557b3d35a3e936c5c394ba9ca99bcbe6b3910 100755 --- a/examples/aishell3/voc1/local/synthesize.sh +++ b/examples/aishell3/voc1/local/synthesize.sh @@ -7,8 +7,8 @@ ckpt_name=$3 FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ python3 ${BIN_DIR}/../synthesize.py \ - --config=${config_path} \ - --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ - --test-metadata=dump/test/norm/metadata.jsonl \ - --output-dir=${train_output_path}/test \ - --generator-type=pwgan + --config=${config_path} \ + --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ + --test-metadata=dump/test/norm/metadata.jsonl \ + --output-dir=${train_output_path}/test \ + --generator-type=pwgan diff --git a/examples/csmsc/tts2/README.md b/examples/csmsc/tts2/README.md index c9332967d685a302a68d3e2a5303421cc50f834b..2c7a917e9d2dd98613e76ffa0b59d90d1231f917 100644 --- a/examples/csmsc/tts2/README.md +++ b/examples/csmsc/tts2/README.md @@ -7,7 +7,7 @@ Download CSMSC from it's [Official Website](https://test.data-baker.com/data/ind ### Get MFA Result and Extract We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for SPEEDYSPEECH. -You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. +You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. ## Get Started Assume the path to the dataset is `~/datasets/BZNSYP`. @@ -18,8 +18,8 @@ Run the command below to 3. train the model. 4. synthesize wavs. - synthesize waveform from `metadata.jsonl`. - - synthesize waveform from text file. -5. inference using static model. + - synthesize waveform from a text file. +5. inference using the static model. ```bash ./run.sh ``` @@ -47,9 +47,9 @@ dump └── feats_stats.npy ``` -The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains log magnitude of mel spectrogram of each utterances, while the norm folder contains normalized spectrogram. The statistics used to normalize the spectrogram is computed from the training set, which is located in `dump/train/feats_stats.npy`. +The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`. -Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, tones, durations, path of spectrogram, and id of each utterance. +Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, tones, durations, the path of the spectrogram, and the id of each utterance. ### Model Training `./local/train.sh` calls `${BIN_DIR}/train.py`. @@ -64,7 +64,7 @@ usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA] [--use-relative-path USE_RELATIVE_PATH] [--phones-dict PHONES_DICT] [--tones-dict TONES_DICT] -Train a Speedyspeech model with sigle speaker dataset. +Train a Speedyspeech model with a single speaker dataset. optional arguments: -h, --help show this help message and exit @@ -87,7 +87,7 @@ optional arguments: 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. -3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. +3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 5. `--phones-dict` is the path of the phone vocabulary file. 6. `--tones-dict` is the path of the tone vocabulary file. @@ -105,107 +105,118 @@ pwg_baker_ckpt_0.4 ├── pwg_snapshot_iter_400000.pdz # model parameters of parallel wavegan └── pwg_stats.npy # statistics used to normalize spectrogram when training parallel wavegan ``` -`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`. +`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} ``` -```text -usage: synthesize.py [-h] [--speedyspeech-config SPEEDYSPEECH_CONFIG] - [--speedyspeech-checkpoint SPEEDYSPEECH_CHECKPOINT] - [--speedyspeech-stat SPEEDYSPEECH_STAT] - [--pwg-config PWG_CONFIG] - [--pwg-checkpoint PWG_CHECKPOINT] [--pwg-stat PWG_STAT] - [--phones-dict PHONES_DICT] [--tones-dict TONES_DICT] - [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR] - [--inference-dir INFERENCE_DIR] [--ngpu NGPU] - [--verbose VERBOSE] - -Synthesize with speedyspeech & parallel wavegan. +``text +usage: synthesize.py [-h] + [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}] + [--am_config AM_CONFIG] [--am_ckpt AM_CKPT] + [--am_stat AM_STAT] [--phones_dict PHONES_DICT] + [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT] + [--voice-cloning VOICE_CLONING] + [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}] + [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] + [--voc_stat VOC_STAT] [--ngpu NGPU] + [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR] + +Synthesize with acoustic model & vocoder optional arguments: -h, --help show this help message and exit - --speedyspeech-config SPEEDYSPEECH_CONFIG - config file for speedyspeech. - --speedyspeech-checkpoint SPEEDYSPEECH_CHECKPOINT - speedyspeech checkpoint to load. - --speedyspeech-stat SPEEDYSPEECH_STAT - mean and standard deviation used to normalize - spectrogram when training speedyspeech. - --pwg-config PWG_CONFIG - config file for parallelwavegan. - --pwg-checkpoint PWG_CHECKPOINT - parallel wavegan generator parameters to load. - --pwg-stat PWG_STAT mean and standard deviation used to normalize - spectrogram when training speedyspeech. - --phones-dict PHONES_DICT + --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk} + Choose acoustic model type of tts task. + --am_config AM_CONFIG + Config of acoustic model. Use deault config when it is + None. + --am_ckpt AM_CKPT Checkpoint file of acoustic model. + --am_stat AM_STAT mean and standard deviation used to normalize + spectrogram when training acoustic model. + --phones_dict PHONES_DICT phone vocabulary file. - --tones-dict TONES_DICT + --tones_dict TONES_DICT tone vocabulary file. - --test-metadata TEST_METADATA - test metadata - --output-dir OUTPUT_DIR - output dir - --inference-dir INFERENCE_DIR - dir to save inference models + --speaker_dict SPEAKER_DICT + speaker id map file. + --voice-cloning VOICE_CLONING + whether training voice cloning model. + --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc} + Choose vocoder type of tts task. + --voc_config VOC_CONFIG + Config of voc. Use deault config when it is None. + --voc_ckpt VOC_CKPT Checkpoint file of voc. + --voc_stat VOC_STAT mean and standard deviation used to normalize + spectrogram when training voc. --ngpu NGPU if ngpu == 0, use cpu. - --verbose VERBOSE verbose + --test_metadata TEST_METADATA + test metadata. + --output_dir OUTPUT_DIR + output dir. ``` -`./local/synthesize_e2e.sh` calls `${BIN_DIR}/synthesize_e2e.py`, which can synthesize waveform from text file. +`./local/synthesize_e2e.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} ``` ```text -usage: synthesize_e2e.py [-h] [--speedyspeech-config SPEEDYSPEECH_CONFIG] - [--speedyspeech-checkpoint SPEEDYSPEECH_CHECKPOINT] - [--speedyspeech-stat SPEEDYSPEECH_STAT] - [--pwg-config PWG_CONFIG] - [--pwg-checkpoint PWG_CHECKPOINT] - [--pwg-stat PWG_STAT] [--text TEXT] - [--phones-dict PHONES_DICT] [--tones-dict TONES_DICT] - [--output-dir OUTPUT_DIR] - [--inference-dir INFERENCE_DIR] [--verbose VERBOSE] - [--ngpu NGPU] - -Synthesize with speedyspeech & parallel wavegan. +usage: synthesize_e2e.py [-h] + [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}] + [--am_config AM_CONFIG] [--am_ckpt AM_CKPT] + [--am_stat AM_STAT] [--phones_dict PHONES_DICT] + [--tones_dict TONES_DICT] + [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID] + [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}] + [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] + [--voc_stat VOC_STAT] [--lang LANG] + [--inference_dir INFERENCE_DIR] [--ngpu NGPU] + [--text TEXT] [--output_dir OUTPUT_DIR] + +Synthesize with acoustic model & vocoder optional arguments: -h, --help show this help message and exit - --speedyspeech-config SPEEDYSPEECH_CONFIG - config file for speedyspeech. - --speedyspeech-checkpoint SPEEDYSPEECH_CHECKPOINT - speedyspeech checkpoint to load. - --speedyspeech-stat SPEEDYSPEECH_STAT - mean and standard deviation used to normalize - spectrogram when training speedyspeech. - --pwg-config PWG_CONFIG - config file for parallelwavegan. - --pwg-checkpoint PWG_CHECKPOINT - parallel wavegan checkpoint to load. - --pwg-stat PWG_STAT mean and standard deviation used to normalize - spectrogram when training speedyspeech. - --text TEXT text to synthesize, a 'utt_id sentence' pair per line - --phones-dict PHONES_DICT + --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk} + Choose acoustic model type of tts task. + --am_config AM_CONFIG + Config of acoustic model. Use deault config when it is + None. + --am_ckpt AM_CKPT Checkpoint file of acoustic model. + --am_stat AM_STAT mean and standard deviation used to normalize + spectrogram when training acoustic model. + --phones_dict PHONES_DICT phone vocabulary file. - --tones-dict TONES_DICT + --tones_dict TONES_DICT tone vocabulary file. - --output-dir OUTPUT_DIR - output dir - --inference-dir INFERENCE_DIR + --speaker_dict SPEAKER_DICT + speaker id map file. + --spk_id SPK_ID spk id for multi speaker acoustic model + --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc} + Choose vocoder type of tts task. + --voc_config VOC_CONFIG + Config of voc. Use deault config when it is None. + --voc_ckpt VOC_CKPT Checkpoint file of voc. + --voc_stat VOC_STAT mean and standard deviation used to normalize + spectrogram when training voc. + --lang LANG Choose model language. zh or en + --inference_dir INFERENCE_DIR dir to save inference models - --verbose VERBOSE verbose --ngpu NGPU if ngpu == 0, use cpu. + --text TEXT text to synthesize, a 'utt_id sentence' pair per line. + --output_dir OUTPUT_DIR + output dir. ``` -1. `--speedyspeech-config`, `--speedyspeech-checkpoint`, `--speedyspeech-stat` are arguments for speedyspeech, which correspond to the 3 files in the speedyspeech pretrained model. -2. `--pwg-config`, `--pwg-checkpoint`, `--pwg-stat` are arguments for parallel wavegan, which correspond to the 3 files in the parallel wavegan pretrained model. -3. `--text` is the text file, which contains sentences to synthesize. -4. `--output-dir` is the directory to save synthesized audio files. -5. `--inference-dir` is the directory to save exported model, which can be used with paddle infernece. -6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. -7. `--phones-dict` is the path of the phone vocabulary file. -8. `--tones-dict` is the path of the tone vocabulary file. +1. `--am` is acoustic model type with the format {model_name}_{dataset} +2. `--am_config`, `--am_checkpoint`, `--am_stat`, `--phones_dict` and `--tones_dict` are arguments for acoustic model, which correspond to the 5 files in the speedyspeech pretrained model. +3. `--voc` is vocoder type with the format {model_name}_{dataset} +4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model. +5. `--lang` is the model language, which can be `zh` or `en`. +6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder. +7. `--text` is the text file, which contains sentences to synthesize. +8. `--output_dir` is the directory to save synthesized audio files. +9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ### Inferencing -After Synthesize, we will get static models of speedyspeech and pwgan in `${train_output_path}/inference`. +After synthesizing, we will get static models of speedyspeech and pwgan in `${train_output_path}/inference`. `./local/inference.sh` calls `${BIN_DIR}/inference.py`, which provides a paddle static model inference example for speedyspeech + pwgan synthesize. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} @@ -214,7 +225,7 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} ## Pretrained Model Pretrained SpeedySpeech model with no silence in the edge of audios[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip). -Static model can be downloaded here [speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip). +The static model can be downloaded here [speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip). Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/ssim_loss :-------------:| :------------:| :-----: | :-----: | :--------:|:--------: @@ -235,16 +246,19 @@ source path.sh FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ -python3 ${BIN_DIR}/synthesize_e2e.py \ - --speedyspeech-config=speedyspeech_nosil_baker_ckpt_0.5/default.yaml \ - --speedyspeech-checkpoint=speedyspeech_nosil_baker_ckpt_0.5/snapshot_iter_11400.pdz \ - --speedyspeech-stat=speedyspeech_nosil_baker_ckpt_0.5/feats_stats.npy \ - --pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \ - --pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \ - --pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \ +python3 ${BIN_DIR}/../synthesize_e2e.py \ + --am=speedyspeech_csmsc \ + --am_config=speedyspeech_nosil_baker_ckpt_0.5/default.yaml \ + --am_ckpt=speedyspeech_nosil_baker_ckpt_0.5/snapshot_iter_11400.pdz \ + --am_stat=speedyspeech_nosil_baker_ckpt_0.5/feats_stats.npy \ + --voc=pwgan_csmsc \ + --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \ + --voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \ + --voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \ + --lang=zh \ --text=${BIN_DIR}/../sentences.txt \ - --output-dir=exp/default/test_e2e \ - --inference-dir=exp/default/inference \ - --phones-dict=speedyspeech_nosil_baker_ckpt_0.5/phone_id_map.txt \ - --tones-dict=speedyspeech_nosil_baker_ckpt_0.5/tone_id_map.txt + --output_dir=exp/default/test_e2e \ + --inference_dir=exp/default/inference \ + --phones_dict=speedyspeech_nosil_baker_ckpt_0.5/phone_id_map.txt \ + --tones_dict=speedyspeech_nosil_baker_ckpt_0.5/tone_id_map.txt ``` diff --git a/examples/csmsc/tts2/conf/default.yaml b/examples/csmsc/tts2/conf/default.yaml index e93b50826ce940155469ff634149b8f6e242fb2a..a3366b8f92660386009847d748ba9072b99bb4bc 100644 --- a/examples/csmsc/tts2/conf/default.yaml +++ b/examples/csmsc/tts2/conf/default.yaml @@ -2,9 +2,9 @@ # FEATURE EXTRACTION SETTING # ########################################################### fs: 24000 # Sampling rate. -n_fft: 2048 # FFT size. -n_shift: 300 # Hop size. -win_length: 1200 # Window length. +n_fft: 2048 # FFT size (samples). +n_shift: 300 # Hop size (samples). 12.5ms +win_length: 1200 # Window length (samples). 50ms # If set to null, it will be the same as fft_size. window: "hann" # Window function. n_mels: 80 # Number of mel basis. diff --git a/examples/csmsc/tts2/local/inference.sh b/examples/csmsc/tts2/local/inference.sh index 37e2e55c7b2b054ce02044473d62630f8d315ec1..d78c3eb3273362fcbae7288005e7ce848e059379 100755 --- a/examples/csmsc/tts2/local/inference.sh +++ b/examples/csmsc/tts2/local/inference.sh @@ -2,9 +2,55 @@ train_output_path=$1 -python3 ${BIN_DIR}/inference.py \ - --inference-dir=${train_output_path}/inference \ - --text=${BIN_DIR}/../sentences.txt \ - --output-dir=${train_output_path}/pd_infer_out \ - --phones-dict=dump/phone_id_map.txt \ - --tones-dict=dump/tone_id_map.txt +stage=0 +stop_stage=0 + +# pwgan +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + python3 ${BIN_DIR}/../inference.py \ + --inference_dir=${train_output_path}/inference \ + --am=speedyspeech_csmsc \ + --voc=pwgan_csmsc \ + --text=${BIN_DIR}/../sentences.txt \ + --output_dir=${train_output_path}/pd_infer_out \ + --phones_dict=dump/phone_id_map.txt \ + --tones_dict=dump/tone_id_map.txt +fi + +# for more GAN Vocoders +# multi band melgan +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + python3 ${BIN_DIR}/../inference.py \ + --inference_dir=${train_output_path}/inference \ + --am=speedyspeech_csmsc \ + --voc=mb_melgan_csmsc \ + --text=${BIN_DIR}/../sentences.txt \ + --output_dir=${train_output_path}/pd_infer_out \ + --phones_dict=dump/phone_id_map.txt \ + --tones_dict=dump/tone_id_map.txt +fi + +# style melgan +# style melgan's Dygraph to Static Graph is not ready now +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + python3 ${BIN_DIR}/../inference.py \ + --inference_dir=${train_output_path}/inference \ + --am=speedyspeech_csmsc \ + --voc=style_melgan_csmsc \ + --text=${BIN_DIR}/../sentences.txt \ + --output_dir=${train_output_path}/pd_infer_out \ + --phones_dict=dump/phone_id_map.txt \ + --tones_dict=dump/tone_id_map.txt +fi + +# hifigan +if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then + python3 ${BIN_DIR}/../inference.py \ + --inference_dir=${train_output_path}/inference \ + --am=speedyspeech_csmsc \ + --voc=hifigan_csmsc \ + --text=${BIN_DIR}/../sentences.txt \ + --output_dir=${train_output_path}/pd_infer_out \ + --phones_dict=dump/phone_id_map.txt \ + --tones_dict=dump/tone_id_map.txt +fi diff --git a/examples/csmsc/tts2/local/synthesize.sh b/examples/csmsc/tts2/local/synthesize.sh index 8be02dfbfb2cca5a6fa693c65bd8fa5de0aa60c4..cedc9717d774666f94f43a8b91c8cafd6e2ad6c3 100755 --- a/examples/csmsc/tts2/local/synthesize.sh +++ b/examples/csmsc/tts2/local/synthesize.sh @@ -5,15 +5,16 @@ ckpt_name=$3 FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ -python3 ${BIN_DIR}/synthesize.py \ - --speedyspeech-config=${config_path} \ - --speedyspeech-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ - --speedyspeech-stat=dump/train/feats_stats.npy \ - --pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \ - --pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \ - --pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \ - --test-metadata=dump/test/norm/metadata.jsonl \ - --output-dir=${train_output_path}/test \ - --inference-dir=${train_output_path}/inference \ - --phones-dict=dump/phone_id_map.txt \ - --tones-dict=dump/tone_id_map.txt \ No newline at end of file +python3 ${BIN_DIR}/../synthesize.py \ + --am=speedyspeech_csmsc \ + --am_config=${config_path} \ + --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --am_stat=dump/train/feats_stats.npy \ + --voc=pwgan_csmsc \ + --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \ + --voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \ + --voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \ + --test_metadata=dump/test/norm/metadata.jsonl \ + --output_dir=${train_output_path}/test \ + --phones_dict=dump/phone_id_map.txt \ + --tones_dict=dump/tone_id_map.txt \ No newline at end of file diff --git a/examples/csmsc/tts2/local/synthesize_e2e.sh b/examples/csmsc/tts2/local/synthesize_e2e.sh index 3cbc7936f7d8c00a0674e4a98876c17993ce643c..a458bd5ff7cfe9a01162fa9f332edf639a79b0f5 100755 --- a/examples/csmsc/tts2/local/synthesize_e2e.sh +++ b/examples/csmsc/tts2/local/synthesize_e2e.sh @@ -4,17 +4,91 @@ config_path=$1 train_output_path=$2 ckpt_name=$3 -FLAGS_allocator_strategy=naive_best_fit \ -FLAGS_fraction_of_gpu_memory_to_use=0.01 \ -python3 ${BIN_DIR}/synthesize_e2e.py \ - --speedyspeech-config=${config_path} \ - --speedyspeech-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ - --speedyspeech-stat=dump/train/feats_stats.npy \ - --pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \ - --pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \ - --pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \ - --text=${BIN_DIR}/../sentences.txt \ - --output-dir=${train_output_path}/test_e2e \ - --inference-dir=${train_output_path}/inference \ - --phones-dict=dump/phone_id_map.txt \ - --tones-dict=dump/tone_id_map.txt +stage=0 +stop_stage=0 + +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/../synthesize_e2e.py \ + --am=speedyspeech_csmsc \ + --am_config=${config_path} \ + --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --am_stat=dump/train/feats_stats.npy \ + --voc=pwgan_csmsc \ + --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \ + --voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \ + --voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \ + --lang=zh \ + --text=${BIN_DIR}/../sentences.txt \ + --output_dir=${train_output_path}/test_e2e \ + --inference_dir=${train_output_path}/inference \ + --phones_dict=dump/phone_id_map.txt \ + --tones_dict=dump/tone_id_map.txt +fi + +# for more GAN Vocoders +# multi band melgan +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/../synthesize_e2e.py \ + --am=speedyspeech_csmsc \ + --am_config=${config_path} \ + --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --am_stat=dump/train/feats_stats.npy \ + --voc=mb_melgan_csmsc \ + --voc_config=mb_melgan_baker_finetune_ckpt_0.5/finetune.yaml \ + --voc_ckpt=mb_melgan_baker_finetune_ckpt_0.5/snapshot_iter_2000000.pdz\ + --voc_stat=mb_melgan_baker_finetune_ckpt_0.5/feats_stats.npy \ + --lang=zh \ + --text=${BIN_DIR}/../sentences.txt \ + --output_dir=${train_output_path}/test_e2e \ + --inference_dir=${train_output_path}/inference \ + --phones_dict=dump/phone_id_map.txt \ + --tones_dict=dump/tone_id_map.txt +fi + +# the pretrained models haven't release now +# style melgan +# style melgan's Dygraph to Static Graph is not ready now +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/../synthesize_e2e.py \ + --am=speedyspeech_csmsc \ + --am_config=${config_path} \ + --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --am_stat=dump/train/feats_stats.npy \ + --voc=style_melgan_csmsc \ + --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \ + --voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \ + --voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \ + --lang=zh \ + --text=${BIN_DIR}/../sentences.txt \ + --output_dir=${train_output_path}/test_e2e \ + --phones_dict=dump/phone_id_map.txt \ + --tones_dict=dump/tone_id_map.txt + # --inference_dir=${train_output_path}/inference +fi + +# hifigan +if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/../synthesize_e2e.py \ + --am=speedyspeech_csmsc \ + --am_config=${config_path} \ + --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --am_stat=dump/train/feats_stats.npy \ + --voc=hifigan_csmsc \ + --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \ + --voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \ + --voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \ + --lang=zh \ + --text=${BIN_DIR}/../sentences.txt \ + --output_dir=${train_output_path}/test_e2e \ + --inference_dir=${train_output_path}/inference \ + --phones_dict=dump/phone_id_map.txt \ + --tones_dict=dump/tone_id_map.txt +fi diff --git a/examples/csmsc/tts3/README.md b/examples/csmsc/tts3/README.md index 488ede2ec132f43afbaf4d2ccb2c01b4c11db76c..570dd28b800bdf3f8aa2d6d3e8c271639ea8b3db 100644 --- a/examples/csmsc/tts3/README.md +++ b/examples/csmsc/tts3/README.md @@ -7,7 +7,7 @@ Download CSMSC from it's [Official Website](https://test.data-baker.com/data/ind ### Get MFA Result and Extract We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2. -You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. +You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. ## Get Started Assume the path to the dataset is `~/datasets/BZNSYP`. @@ -18,12 +18,12 @@ Run the command below to 3. train the model. 4. synthesize wavs. - synthesize waveform from `metadata.jsonl`. - - synthesize waveform from text file. -5. inference using static model. + - synthesize waveform from a text file. +5. inference using the static model. ```bash ./run.sh ``` -You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset. ```bash ./run.sh --stage 0 --stop-stage 0 ``` @@ -50,9 +50,9 @@ dump ├── raw └── speech_stats.npy ``` -The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains speech、pitch and energy features of each utterances, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`. +The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech、pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`. -Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, durations, path of speech features, path of pitch features, path of energy features, speaker and id of each utterance. +Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, the path of pitch features, the path of energy features, speaker, and the id of each utterance. ### Model Training ```bash @@ -86,7 +86,7 @@ optional arguments: ``` 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. -3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. +3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 5. `--phones-dict` is the path of the phone vocabulary file. @@ -103,100 +103,118 @@ pwg_baker_ckpt_0.4 ├── pwg_snapshot_iter_400000.pdz # model parameters of parallel wavegan └── pwg_stats.npy # statistics used to normalize spectrogram when training parallel wavegan ``` -`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`. +`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} ``` ```text -usage: synthesize.py [-h] [--fastspeech2-config FASTSPEECH2_CONFIG] - [--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT] - [--fastspeech2-stat FASTSPEECH2_STAT] - [--pwg-config PWG_CONFIG] - [--pwg-checkpoint PWG_CHECKPOINT] [--pwg-stat PWG_STAT] - [--phones-dict PHONES_DICT] [--speaker-dict SPEAKER_DICT] - [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR] - [--ngpu NGPU] [--verbose VERBOSE] +usage: synthesize.py [-h] + [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}] + [--am_config AM_CONFIG] [--am_ckpt AM_CKPT] + [--am_stat AM_STAT] [--phones_dict PHONES_DICT] + [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT] + [--voice-cloning VOICE_CLONING] + [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}] + [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] + [--voc_stat VOC_STAT] [--ngpu NGPU] + [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR] -Synthesize with fastspeech2 & parallel wavegan. +Synthesize with acoustic model & vocoder optional arguments: -h, --help show this help message and exit - --fastspeech2-config FASTSPEECH2_CONFIG - fastspeech2 config file. - --fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT - fastspeech2 checkpoint to load. - --fastspeech2-stat FASTSPEECH2_STAT - mean and standard deviation used to normalize - spectrogram when training fastspeech2. - --pwg-config PWG_CONFIG - parallel wavegan config file. - --pwg-checkpoint PWG_CHECKPOINT - parallel wavegan generator parameters to load. - --pwg-stat PWG_STAT mean and standard deviation used to normalize - spectrogram when training parallel wavegan. - --phones-dict PHONES_DICT + --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk} + Choose acoustic model type of tts task. + --am_config AM_CONFIG + Config of acoustic model. Use deault config when it is + None. + --am_ckpt AM_CKPT Checkpoint file of acoustic model. + --am_stat AM_STAT mean and standard deviation used to normalize + spectrogram when training acoustic model. + --phones_dict PHONES_DICT phone vocabulary file. - --speaker-dict SPEAKER_DICT - speaker id map file for multiple speaker model. - --test-metadata TEST_METADATA + --tones_dict TONES_DICT + tone vocabulary file. + --speaker_dict SPEAKER_DICT + speaker id map file. + --voice-cloning VOICE_CLONING + whether training voice cloning model. + --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc} + Choose vocoder type of tts task. + --voc_config VOC_CONFIG + Config of voc. Use deault config when it is None. + --voc_ckpt VOC_CKPT Checkpoint file of voc. + --voc_stat VOC_STAT mean and standard deviation used to normalize + spectrogram when training voc. + --ngpu NGPU if ngpu == 0, use cpu. + --test_metadata TEST_METADATA test metadata. - --output-dir OUTPUT_DIR + --output_dir OUTPUT_DIR output dir. - --ngpu NGPU if ngpu == 0, use cpu. - --verbose VERBOSE verbose. ``` -`./local/synthesize_e2e.sh` calls `${BIN_DIR}/synthesize_e2e.py`, which can synthesize waveform from text file. +`./local/synthesize_e2e.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} ``` ```text -usage: synthesize_e2e.py [-h] [--fastspeech2-config FASTSPEECH2_CONFIG] - [--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT] - [--fastspeech2-stat FASTSPEECH2_STAT] - [--pwg-config PWG_CONFIG] - [--pwg-checkpoint PWG_CHECKPOINT] - [--pwg-stat PWG_STAT] [--phones-dict PHONES_DICT] - [--text TEXT] [--output-dir OUTPUT_DIR] - [--inference-dir INFERENCE_DIR] [--ngpu NGPU] - [--verbose VERBOSE] - -Synthesize with fastspeech2 & parallel wavegan. +usage: synthesize_e2e.py [-h] + [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}] + [--am_config AM_CONFIG] [--am_ckpt AM_CKPT] + [--am_stat AM_STAT] [--phones_dict PHONES_DICT] + [--tones_dict TONES_DICT] + [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID] + [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}] + [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] + [--voc_stat VOC_STAT] [--lang LANG] + [--inference_dir INFERENCE_DIR] [--ngpu NGPU] + [--text TEXT] [--output_dir OUTPUT_DIR] + +Synthesize with acoustic model & vocoder optional arguments: -h, --help show this help message and exit - --fastspeech2-config FASTSPEECH2_CONFIG - fastspeech2 config file. - --fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT - fastspeech2 checkpoint to load. - --fastspeech2-stat FASTSPEECH2_STAT - mean and standard deviation used to normalize - spectrogram when training fastspeech2. - --pwg-config PWG_CONFIG - parallel wavegan config file. - --pwg-checkpoint PWG_CHECKPOINT - parallel wavegan generator parameters to load. - --pwg-stat PWG_STAT mean and standard deviation used to normalize - spectrogram when training parallel wavegan. - --phones-dict PHONES_DICT + --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk} + Choose acoustic model type of tts task. + --am_config AM_CONFIG + Config of acoustic model. Use deault config when it is + None. + --am_ckpt AM_CKPT Checkpoint file of acoustic model. + --am_stat AM_STAT mean and standard deviation used to normalize + spectrogram when training acoustic model. + --phones_dict PHONES_DICT phone vocabulary file. - --text TEXT text to synthesize, a 'utt_id sentence' pair per line. - --output-dir OUTPUT_DIR - output dir. - --inference-dir INFERENCE_DIR + --tones_dict TONES_DICT + tone vocabulary file. + --speaker_dict SPEAKER_DICT + speaker id map file. + --spk_id SPK_ID spk id for multi speaker acoustic model + --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc} + Choose vocoder type of tts task. + --voc_config VOC_CONFIG + Config of voc. Use deault config when it is None. + --voc_ckpt VOC_CKPT Checkpoint file of voc. + --voc_stat VOC_STAT mean and standard deviation used to normalize + spectrogram when training voc. + --lang LANG Choose model language. zh or en + --inference_dir INFERENCE_DIR dir to save inference models --ngpu NGPU if ngpu == 0, use cpu. - --verbose VERBOSE verbose. + --text TEXT text to synthesize, a 'utt_id sentence' pair per line. + --output_dir OUTPUT_DIR + output dir. ``` - -1. `--fastspeech2-config`, `--fastspeech2-checkpoint`, `--fastspeech2-stat` and `--phones-dict` are arguments for fastspeech2, which correspond to the 4 files in the fastspeech2 pretrained model. -2. `--pwg-config`, `--pwg-checkpoint`, `--pwg-stat` are arguments for parallel wavegan, which correspond to the 3 files in the parallel wavegan pretrained model. -3. `--test-metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder. -4. `--text` is the text file, which contains sentences to synthesize. -5. `--output-dir` is the directory to save synthesized audio files. -6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. +1. `--am` is acoustic model type with the format {model_name}_{dataset} +2. `--am_config`, `--am_checkpoint`, `--am_stat` and `--phones_dict` are arguments for acoustic model, which correspond to the 4 files in the fastspeech2 pretrained model. +3. `--voc` is vocoder type with the format {model_name}_{dataset} +4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model. +5. `--lang` is the model language, which can be `zh` or `en`. +6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder. +7. `--text` is the text file, which contains sentences to synthesize. +8. `--output_dir` is the directory to save synthesized audio files. +9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ### Inferencing -After Synthesize, we will get static models of fastspeech2 and pwgan in `${train_output_path}/inference`. +After synthesizing, we will get static models of fastspeech2 and pwgan in `${train_output_path}/inference`. `./local/inference.sh` calls `${BIN_DIR}/inference.py`, which provides a paddle static model inference example for fastspeech2 + pwgan synthesize. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} @@ -207,7 +225,7 @@ Pretrained FastSpeech2 model with no silence in the edge of audios: - [fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip) - [fastspeech2_conformer_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip) -Static model can be downloaded here [fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip). +The static model can be downloaded here [fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip). Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------: @@ -228,15 +246,18 @@ source path.sh FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ -python3 ${BIN_DIR}/synthesize_e2e.py \ - --fastspeech2-config=fastspeech2_nosil_baker_ckpt_0.4/default.yaml \ - --fastspeech2-checkpoint=fastspeech2_nosil_baker_ckpt_0.4/snapshot_iter_76000.pdz \ - --fastspeech2-stat=fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy \ - --pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \ - --pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \ - --pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \ +python3 ${BIN_DIR}/../synthesize_e2e.py \ + --am=fastspeech2_csmsc \ + --am_config=fastspeech2_nosil_baker_ckpt_0.4/default.yaml \ + --am_ckpt=fastspeech2_nosil_baker_ckpt_0.4/snapshot_iter_76000.pdz \ + --am_stat=fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy \ + --voc=pwgan_csmsc \ + --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \ + --voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \ + --voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \ + --lang=zh \ --text=${BIN_DIR}/../sentences.txt \ - --output-dir=exp/default/test_e2e \ - --inference-dir=exp/default/inference \ - --phones-dict=fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt + --output_dir=exp/default/test_e2e \ + --inference_dir=exp/default/inference \ + --phones_dict=dump/phone_id_map.txt ``` diff --git a/examples/csmsc/tts3/conf/conformer.yaml b/examples/csmsc/tts3/conf/conformer.yaml index a34ef318d7589014a85473cd6570b7fa18532017..252f634d8deff7a14ebf42e284027db78642cc8a 100644 --- a/examples/csmsc/tts3/conf/conformer.yaml +++ b/examples/csmsc/tts3/conf/conformer.yaml @@ -3,9 +3,9 @@ ########################################################### fs: 24000 # sr -n_fft: 2048 # FFT size. -n_shift: 300 # Hop size. -win_length: 1200 # Window length. +n_fft: 2048 # FFT size (samples). +n_shift: 300 # Hop size (samples). 12.5ms +win_length: 1200 # Window length (samples). 50ms # If set to null, it will be the same as fft_size. window: "hann" # Window function. diff --git a/examples/csmsc/tts3/conf/default.yaml b/examples/csmsc/tts3/conf/default.yaml index 55dca6d859357dc716555140d40ae6b06a6dc21b..1f723d67cd6051ee885cd2f909483d0a2aed6438 100644 --- a/examples/csmsc/tts3/conf/default.yaml +++ b/examples/csmsc/tts3/conf/default.yaml @@ -3,9 +3,9 @@ ########################################################### fs: 24000 # sr -n_fft: 2048 # FFT size. -n_shift: 300 # Hop size. -win_length: 1200 # Window length. +n_fft: 2048 # FFT size (samples). +n_shift: 300 # Hop size (samples). 12.5ms +win_length: 1200 # Window length (samples). 50ms # If set to null, it will be the same as fft_size. window: "hann" # Window function. diff --git a/examples/csmsc/tts3/local/inference.sh b/examples/csmsc/tts3/local/inference.sh index cab72547c76a52e7b610eea6ad340b16a72216fa..7c58980cdd1e9602743b13dccbaac09b0e3f443b 100755 --- a/examples/csmsc/tts3/local/inference.sh +++ b/examples/csmsc/tts3/local/inference.sh @@ -2,8 +2,50 @@ train_output_path=$1 -python3 ${BIN_DIR}/inference.py \ - --inference-dir=${train_output_path}/inference \ - --text=${BIN_DIR}/../sentences.txt \ - --output-dir=${train_output_path}/pd_infer_out \ - --phones-dict=dump/phone_id_map.txt +stage=0 +stop_stage=0 + +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + python3 ${BIN_DIR}/../inference.py \ + --inference_dir=${train_output_path}/inference \ + --am=fastspeech2_csmsc \ + --voc=pwgan_csmsc \ + --text=${BIN_DIR}/../sentences.txt \ + --output_dir=${train_output_path}/pd_infer_out \ + --phones_dict=dump/phone_id_map.txt +fi + +# for more GAN Vocoders +# multi band melgan +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + python3 ${BIN_DIR}/../inference.py \ + --inference_dir=${train_output_path}/inference \ + --am=fastspeech2_csmsc \ + --voc=mb_melgan_csmsc \ + --text=${BIN_DIR}/../sentences.txt \ + --output_dir=${train_output_path}/pd_infer_out \ + --phones_dict=dump/phone_id_map.txt +fi + +# style melgan +# style melgan's Dygraph to Static Graph is not ready now +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + python3 ${BIN_DIR}/../inference.py \ + --inference_dir=${train_output_path}/inference \ + --am=fastspeech2_csmsc \ + --voc=style_melgan_csmsc \ + --text=${BIN_DIR}/../sentences.txt \ + --output_dir=${train_output_path}/pd_infer_out \ + --phones_dict=dump/phone_id_map.txt +fi + +# hifigan +if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then + python3 ${BIN_DIR}/../inference.py \ + --inference_dir=${train_output_path}/inference \ + --am=fastspeech2_csmsc \ + --voc=hifigan_csmsc \ + --text=${BIN_DIR}/../sentences.txt \ + --output_dir=${train_output_path}/pd_infer_out \ + --phones_dict=dump/phone_id_map.txt +fi \ No newline at end of file diff --git a/examples/csmsc/tts3/local/synthesize.sh b/examples/csmsc/tts3/local/synthesize.sh index e525fc162026a12ced5af4cc2b28e5a2b911a487..1976742660c42a46e4dd8ceef61e629286c08b18 100755 --- a/examples/csmsc/tts3/local/synthesize.sh +++ b/examples/csmsc/tts3/local/synthesize.sh @@ -6,13 +6,15 @@ ckpt_name=$3 FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ -python3 ${BIN_DIR}/synthesize.py \ - --fastspeech2-config=${config_path} \ - --fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ - --fastspeech2-stat=dump/train/speech_stats.npy \ - --pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \ - --pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \ - --pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \ - --test-metadata=dump/test/norm/metadata.jsonl \ - --output-dir=${train_output_path}/test \ - --phones-dict=dump/phone_id_map.txt +python3 ${BIN_DIR}/../synthesize.py \ + --am=fastspeech2_csmsc \ + --am_config=${config_path} \ + --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --am_stat=dump/train/speech_stats.npy \ + --voc=pwgan_csmsc \ + --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \ + --voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \ + --voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \ + --test_metadata=dump/test/norm/metadata.jsonl \ + --output_dir=${train_output_path}/test \ + --phones_dict=dump/phone_id_map.txt diff --git a/examples/csmsc/tts3/local/synthesize_e2e.sh b/examples/csmsc/tts3/local/synthesize_e2e.sh index cc27ffb6f041686b84a607c85992c6223b4464eb..891ed041b3129dab88d1fe38b424ce099e0fdc1f 100755 --- a/examples/csmsc/tts3/local/synthesize_e2e.sh +++ b/examples/csmsc/tts3/local/synthesize_e2e.sh @@ -4,16 +4,88 @@ config_path=$1 train_output_path=$2 ckpt_name=$3 -FLAGS_allocator_strategy=naive_best_fit \ -FLAGS_fraction_of_gpu_memory_to_use=0.01 \ -python3 ${BIN_DIR}/synthesize_e2e.py \ - --fastspeech2-config=${config_path} \ - --fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ - --fastspeech2-stat=dump/train/speech_stats.npy \ - --pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \ - --pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \ - --pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \ - --text=${BIN_DIR}/../sentences.txt \ - --output-dir=${train_output_path}/test_e2e \ - --inference-dir=${train_output_path}/inference \ - --phones-dict=dump/phone_id_map.txt +stage=0 +stop_stage=0 + +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/../synthesize_e2e.py \ + --am=fastspeech2_csmsc \ + --am_config=${config_path} \ + --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --am_stat=dump/train/speech_stats.npy \ + --voc=pwgan_csmsc \ + --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \ + --voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \ + --voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \ + --lang=zh \ + --text=${BIN_DIR}/../sentences.txt \ + --output_dir=${train_output_path}/test_e2e \ + --inference_dir=${train_output_path}/inference \ + --phones_dict=dump/phone_id_map.txt +fi + +# for more GAN Vocoders +# multi band melgan +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/../synthesize_e2e.py \ + --am=fastspeech2_csmsc \ + --am_config=${config_path} \ + --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --am_stat=dump/train/speech_stats.npy \ + --voc=mb_melgan_csmsc \ + --voc_config=mb_melgan_baker_finetune_ckpt_0.5/finetune.yaml \ + --voc_ckpt=mb_melgan_baker_finetune_ckpt_0.5/snapshot_iter_2000000.pdz\ + --voc_stat=mb_melgan_baker_finetune_ckpt_0.5/feats_stats.npy \ + --lang=zh \ + --text=${BIN_DIR}/../sentences.txt \ + --output_dir=${train_output_path}/test_e2e \ + --inference_dir=${train_output_path}/inference \ + --phones_dict=dump/phone_id_map.txt +fi + +# the pretrained models haven't release now +# style melgan +# style melgan's Dygraph to Static Graph is not ready now +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/../synthesize_e2e.py \ + --am=fastspeech2_csmsc \ + --am_config=${config_path} \ + --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --am_stat=dump/train/speech_stats.npy \ + --voc=style_melgan_csmsc \ + --voc_config=style_melgan_test/default.yaml \ + --voc_ckpt=style_melgan_test/snapshot_iter_935000.pdz \ + --voc_stat=style_melgan_test/feats_stats.npy \ + --lang=zh \ + --text=${BIN_DIR}/../sentences.txt \ + --output_dir=${train_output_path}/test_e2e \ + --phones_dict=dump/phone_id_map.txt + # --inference_dir=${train_output_path}/inference +fi + +# hifigan +if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then + echo "in hifigan syn_e2e" + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/../synthesize_e2e.py \ + --am=fastspeech2_csmsc \ + --am_config=${config_path} \ + --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --am_stat=dump/train/speech_stats.npy \ + --voc=hifigan_csmsc \ + --voc_config=hifigan_test/default.yaml \ + --voc_ckpt=hifigan_test/snapshot_iter_1600000.pdz \ + --voc_stat=hifigan_test/feats_stats.npy \ + --lang=zh \ + --text=${BIN_DIR}/../sentences.txt \ + --output_dir=${train_output_path}/test_e2e \ + --inference_dir=${train_output_path}/inference \ + --phones_dict=dump/phone_id_map.txt +fi diff --git a/examples/csmsc/voc1/README.md b/examples/csmsc/voc1/README.md index b13d5896474589b8fe30ff51834ea2da9769cd2a..19a9c722ac45c90ee91f4e29d447b9183dc3266f 100644 --- a/examples/csmsc/voc1/README.md +++ b/examples/csmsc/voc1/README.md @@ -2,11 +2,11 @@ This example contains code used to train a [parallel wavegan](http://arxiv.org/abs/1910.11480) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html). ## Dataset ### Download and Extract -Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/BZNSYP`. +Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`. ### Get MFA Result and Extract -We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. -You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. +We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio. +You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. ## Get Started Assume the path to the dataset is `~/datasets/BZNSYP`. @@ -20,7 +20,7 @@ Run the command below to ```bash ./run.sh ``` -You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset. ```bash ./run.sh --stage 0 --stop-stage 0 ``` @@ -43,9 +43,9 @@ dump ├── raw └── feats_stats.npy ``` -The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains log magnitude of mel spectrogram of each utterances, while the norm folder contains normalized spectrogram. The statistics used to normalize the spectrogram is computed from the training set, which is located in `dump/train/feats_stats.npy`. +The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`. -Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance. +Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance. ### Model Training ```bash @@ -91,7 +91,7 @@ benchmark: 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. -3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. +3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ### Synthesizing @@ -100,15 +100,19 @@ benchmark: CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} ``` ```text -usage: synthesize.py [-h] [--config CONFIG] [--checkpoint CHECKPOINT] - [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR] - [--ngpu NGPU] [--verbose VERBOSE] +usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG] + [--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA] + [--output-dir OUTPUT_DIR] [--ngpu NGPU] + [--verbose VERBOSE] -Synthesize with parallel wavegan. +Synthesize with GANVocoder. optional arguments: -h, --help show this help message and exit - --config CONFIG parallel wavegan config file. + --generator-type GENERATOR_TYPE + type of GANVocoder, should in {pwgan, mb_melgan, + style_melgan, } now + --config CONFIG GANVocoder config file. --checkpoint CHECKPOINT snapshot to load. --test-metadata TEST_METADATA @@ -126,9 +130,9 @@ optional arguments: 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ## Pretrained Models -Pretrained model can be downloaded here [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip). +The pretrained model can be downloaded here [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip). -Static model can be downloaded here [pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip). +The static model can be downloaded here [pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip). Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss:| eval/spectral_convergence_loss :-------------:| :------------:| :-----: | :-----: | :--------: diff --git a/examples/csmsc/voc1/conf/default.yaml b/examples/csmsc/voc1/conf/default.yaml index 21bdd04050295af6a071f5b373bf2a0e93fe16c6..9ea81b8d3c2e225041d95cd0196ba1d9b606142b 100644 --- a/examples/csmsc/voc1/conf/default.yaml +++ b/examples/csmsc/voc1/conf/default.yaml @@ -7,9 +7,9 @@ # FEATURE EXTRACTION SETTING # ########################################################### fs: 24000 # Sampling rate. -n_fft: 2048 # FFT size. (in samples) -n_shift: 300 # Hop size. (in samples) -win_length: 1200 # Window length. (in samples) +n_fft: 2048 # FFT size (samples). +n_shift: 300 # Hop size (samples). 12.5ms +win_length: 1200 # Window length (samples). 50ms # If set to null, it will be the same as fft_size. window: "hann" # Window function. n_mels: 80 # Number of mel basis. @@ -56,9 +56,9 @@ discriminator_params: bias: true # Whether to use bias parameter in conv. use_weight_norm: true # Whether to use weight norm. # If set to true, it will be applied to all of the conv layers. - nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv. + nonlinear_activation: "leakyrelu" # Nonlinear function after each conv. nonlinear_activation_params: # Nonlinear function parameters - negative_slope: 0.2 # Alpha in LeakyReLU. + negative_slope: 0.2 # Alpha in leakyrelu. ########################################################### # STFT LOSS SETTING # diff --git a/examples/csmsc/voc1/local/synthesize.sh b/examples/csmsc/voc1/local/synthesize.sh index d85d1b1d970c613630e6288c5b33d88f7a475016..145557b3d35a3e936c5c394ba9ca99bcbe6b3910 100755 --- a/examples/csmsc/voc1/local/synthesize.sh +++ b/examples/csmsc/voc1/local/synthesize.sh @@ -7,8 +7,8 @@ ckpt_name=$3 FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ python3 ${BIN_DIR}/../synthesize.py \ - --config=${config_path} \ - --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ - --test-metadata=dump/test/norm/metadata.jsonl \ - --output-dir=${train_output_path}/test \ - --generator-type=pwgan + --config=${config_path} \ + --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ + --test-metadata=dump/test/norm/metadata.jsonl \ + --output-dir=${train_output_path}/test \ + --generator-type=pwgan diff --git a/examples/csmsc/voc3/README.md b/examples/csmsc/voc3/README.md index 99cef233f5d6792c9cdaeb28c1676214903693c2..e4f6be4e8260972ed334d2356d845213de95d0a1 100644 --- a/examples/csmsc/voc3/README.md +++ b/examples/csmsc/voc3/README.md @@ -2,11 +2,11 @@ This example contains code used to train a [Multi Band MelGAN](https://arxiv.org/abs/2005.05106) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html). ## Dataset ### Download and Extract -Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/BZNSYP`. +Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`. ### Get MFA Result and Extract -We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. -You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/mfa) of our repo. +We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut the silence in the edge of audio. +You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/mfa) of our repo. ## Get Started Assume the path to the dataset is `~/datasets/BZNSYP`. @@ -20,7 +20,7 @@ Run the command below to ```bash ./run.sh ``` -You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset. ```bash ./run.sh --stage 0 --stop-stage 0 ``` @@ -43,9 +43,9 @@ dump ├── raw └── feats_stats.npy ``` -The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains log magnitude of mel spectrogram of each utterances, while the norm folder contains normalized spectrogram. The statistics used to normalize the spectrogram is computed from the training set, which is located in `dump/train/feats_stats.npy`. +The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`. -Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance. +Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance. ### Model Training ```bash @@ -76,7 +76,7 @@ optional arguments: 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. -3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. +3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ### Synthesizing @@ -85,15 +85,19 @@ optional arguments: CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} ``` ```text -usage: synthesize.py [-h] [--config CONFIG] [--checkpoint CHECKPOINT] - [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR] - [--ngpu NGPU] [--verbose VERBOSE] +usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG] + [--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA] + [--output-dir OUTPUT_DIR] [--ngpu NGPU] + [--verbose VERBOSE] -Synthesize with multi band melgan. +Synthesize with GANVocoder. optional arguments: -h, --help show this help message and exit - --config CONFIG multi band melgan config file. + --generator-type GENERATOR_TYPE + type of GANVocoder, should in {pwgan, mb_melgan, + style_melgan, } now + --config CONFIG GANVocoder config file. --checkpoint CHECKPOINT snapshot to load. --test-metadata TEST_METADATA @@ -111,22 +115,22 @@ optional arguments: 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ## Fine-tuning -Since there are no `noise` in the input of Multi Band MelGAN, the audio quality is not so good (see [espnet issue](https://github.com/espnet/espnet/issues/3536#issuecomment-916035415)), we refer to the method proposed in [HiFiGAN](https://arxiv.org/abs/2010.05646), finetune Multi Band MelGAN with the predicted mel-spectrogram from `FastSpeech2`. +Since there is no `noise` in the input of Multi-Band MelGAN, the audio quality is not so good (see [espnet issue](https://github.com/espnet/espnet/issues/3536#issuecomment-916035415)), we refer to the method proposed in [HiFiGAN](https://arxiv.org/abs/2010.05646), finetune Multi-Band MelGAN with the predicted mel-spectrogram from `FastSpeech2`. The length of mel-spectrograms should align with the length of wavs, so we should generate mels using ground truth alignment. -But since we are fine-tuning, we should use the statistics computed during training step. +But since we are fine-tuning, we should use the statistics computed during the training step. -You should first download pretrained `FastSpeech2` model from [fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip) and `unzip` it. +You should first download pretrained `FastSpeech2` model from [fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip) and `unzip` it. -Assume the path to the dump-dir of training step is `dump`. -Assume the path to the duration result of CSMSC is `durations.txt` (generated during training step's preprocessing). +Assume the path to the dump-dir of training step is `dump`. +Assume the path to the duration result of CSMSC is `durations.txt` (generated during the training step's preprocessing). Assume the path to the pretrained `FastSpeech2` model is `fastspeech2_nosil_baker_ckpt_0.4`. \ The `finetune.sh` can 1. **source path**. 2. generate ground truth alignment mels. -3. link `*_wave.npy` from `dump` to `dump_finetune` (because we only use new mels, the wavs are the ones used during train step) . +3. link `*_wave.npy` from `dump` to `dump_finetune` (because we only use new mels, the wavs are the ones used during the training step). 4. copy features' stats from `dump` to `dump_finetune`. 5. normalize the ground truth alignment mels. 6. finetune the model. @@ -137,9 +141,9 @@ exp/finetune/checkpoints ├── records.jsonl └── snapshot_iter_1000000.pdz ``` -The content of `records.jsonl` should be as follows (change `"path"` to your own ckpt path): +The content of `records.jsonl` should be as follows (change `"path"` to your ckpt path): ``` -{"time": "2021-11-21 15:11:20.337311", "path": "~/PaddleSpeech/examples/csmsc/voc3/exp/finetune/checkpoints/snapshot_iter_1000000.pdz", "iteration": 1000000}↩ +{"time": "2021-11-21 15:11:20.337311", "path": "~/PaddleSpeech/examples/csmsc/voc3/exp/finetune/checkpoints/snapshot_iter_1000000.pdz", "iteration": 1000000} ``` Run the command below ```bash @@ -151,11 +155,11 @@ TODO: The hyperparameter of `finetune.yaml` is not good enough, a smaller `learning_rate` should be used (more `milestones` should be set). ## Pretrained Models -Pretrained model can be downloaded here [mb_melgan_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_ckpt_0.5.zip). +The pretrained model can be downloaded here [mb_melgan_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_ckpt_0.5.zip). -Finetuned model can ben downloaded here [mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip). +The finetuned model can be downloaded here [mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip). -Static model can be downloaded here [mb_melgan_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_static_0.5.zip) +The static model can be downloaded here [mb_melgan_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_static_0.5.zip) Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss|eval/spectral_convergence_loss |eval/sub_log_stft_magnitude_loss|eval/sub_spectral_convergence_loss :-------------:| :------------:| :-----: | :-----: | :--------:| :--------:| :--------: diff --git a/examples/csmsc/voc3/conf/default.yaml b/examples/csmsc/voc3/conf/default.yaml index e66d98a63dda8b6e1bc758104bc65788f12ba02a..27e97664aa2eaf16a1e0b5c154dcf028d58cc4ee 100644 --- a/examples/csmsc/voc3/conf/default.yaml +++ b/examples/csmsc/voc3/conf/default.yaml @@ -12,9 +12,9 @@ # FEATURE EXTRACTION SETTING # ########################################################### fs: 24000 # Sampling rate. -n_fft: 2048 # FFT size. (in samples) -n_shift: 300 # Hop size. (in samples) -win_length: 1200 # Window length. (in samples) +n_fft: 2048 # FFT size (samples). +n_shift: 300 # Hop size (samples). 12.5ms +win_length: 1200 # Window length (samples). 50ms # If set to null, it will be the same as fft_size. window: "hann" # Window function. n_mels: 80 # Number of mel basis. @@ -54,7 +54,7 @@ discriminator_params: channels: 16 # Number of channels of the initial conv layer. max_downsample_channels: 512 # Maximum number of channels of downsampling layers. downsample_scales: [4, 4, 4] # List of downsampling scales. - nonlinear_activation: "LeakyReLU" # Nonlinear activation function. + nonlinear_activation: "leakyrelu" # Nonlinear activation function. nonlinear_activation_params: # Parameters of nonlinear activation function. negative_slope: 0.2 use_weight_norm: True # Whether to use weight norm. diff --git a/examples/csmsc/voc3/conf/finetune.yaml b/examples/csmsc/voc3/conf/finetune.yaml index 8610c526e08fbd7d5b8cf4feec7613bc5d69177e..a3b1d8b113f8c5647d5942b0c270dede1b90593b 100644 --- a/examples/csmsc/voc3/conf/finetune.yaml +++ b/examples/csmsc/voc3/conf/finetune.yaml @@ -12,9 +12,9 @@ # FEATURE EXTRACTION SETTING # ########################################################### fs: 24000 # Sampling rate. -n_fft: 2048 # FFT size. (in samples) -n_shift: 300 # Hop size. (in samples) -win_length: 1200 # Window length. (in samples) +n_fft: 2048 # FFT size (samples). +n_shift: 300 # Hop size (samples). 12.5ms +win_length: 1200 # Window length (samples). 50ms # If set to null, it will be the same as fft_size. window: "hann" # Window function. n_mels: 80 # Number of mel basis. @@ -54,7 +54,7 @@ discriminator_params: channels: 16 # Number of channels of the initial conv layer. max_downsample_channels: 512 # Maximum number of channels of downsampling layers. downsample_scales: [4, 4, 4] # List of downsampling scales. - nonlinear_activation: "LeakyReLU" # Nonlinear activation function. + nonlinear_activation: "leakyrelu" # Nonlinear activation function. nonlinear_activation_params: # Parameters of nonlinear activation function. negative_slope: 0.2 use_weight_norm: True # Whether to use weight norm. diff --git a/examples/csmsc/voc3/local/synthesize.sh b/examples/csmsc/voc3/local/synthesize.sh index 22d879fa33646cf311fc8580f514a58d616d82a8..07e791a586b8ed217e8f06c0c379bc97a111c522 100755 --- a/examples/csmsc/voc3/local/synthesize.sh +++ b/examples/csmsc/voc3/local/synthesize.sh @@ -7,8 +7,8 @@ ckpt_name=$3 FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ python3 ${BIN_DIR}/../synthesize.py \ - --config=${config_path} \ - --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ - --test-metadata=dump/test/norm/metadata.jsonl \ - --output-dir=${train_output_path}/test \ - --generator-type=mb_melgan + --config=${config_path} \ + --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ + --test-metadata=dump/test/norm/metadata.jsonl \ + --output-dir=${train_output_path}/test \ + --generator-type=mb_melgan diff --git a/examples/csmsc/voc4/README.md b/examples/csmsc/voc4/README.md index 86030e39d69deba073d70945f28e0eab7ac60b34..57d88e0fc59a09f163bcdfb208eb6d561dd53795 100644 --- a/examples/csmsc/voc4/README.md +++ b/examples/csmsc/voc4/README.md @@ -2,11 +2,11 @@ This example contains code used to train a [Style MelGAN](https://arxiv.org/abs/2011.01557) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html). ## Dataset ### Download and Extract -Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/BZNSYP`. +Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`. ### Get MFA Result and Extract -We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. -You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/mfa) of our repo. +We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut the silence in the edge of audio. +You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/mfa) of our repo. ## Get Started Assume the path to the dataset is `~/datasets/BZNSYP`. @@ -20,7 +20,7 @@ Run the command below to ```bash ./run.sh ``` -You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset. ```bash ./run.sh --stage 0 --stop-stage 0 ``` @@ -43,9 +43,9 @@ dump ├── raw └── feats_stats.npy ``` -The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains log magnitude of mel spectrogram of each utterances, while the norm folder contains normalized spectrogram. The statistics used to normalize the spectrogram is computed from the training set, which is located in `dump/train/feats_stats.npy`. +The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`. -Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance. +Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance. ### Model Training ```bash @@ -76,7 +76,7 @@ optional arguments: 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. -3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. +3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ### Synthesizing @@ -85,15 +85,19 @@ optional arguments: CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} ``` ```text -usage: synthesize.py [-h] [--config CONFIG] [--checkpoint CHECKPOINT] - [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR] - [--ngpu NGPU] [--verbose VERBOSE] +usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG] + [--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA] + [--output-dir OUTPUT_DIR] [--ngpu NGPU] + [--verbose VERBOSE] -Synthesize with multi band melgan. +Synthesize with GANVocoder. optional arguments: -h, --help show this help message and exit - --config CONFIG multi band melgan config file. + --generator-type GENERATOR_TYPE + type of GANVocoder, should in {pwgan, mb_melgan, + style_melgan, } now + --config CONFIG GANVocoder config file. --checkpoint CHECKPOINT snapshot to load. --test-metadata TEST_METADATA @@ -104,7 +108,7 @@ optional arguments: --verbose VERBOSE verbose. ``` -1. `--config` multi band melgan config file. You should use the same config with which the model is trained. +1. `--config` style melgan config file. You should use the same config with which the model is trained. 2. `--checkpoint` is the checkpoint to load. Pick one of the checkpoints from `checkpoints` inside the training output directory. 3. `--test-metadata` is the metadata of the test dataset. Use the `metadata.jsonl` in the `dev/norm` subfolder from the processed directory. 4. `--output-dir` is the directory to save the synthesized audio files. diff --git a/examples/csmsc/voc4/conf/default.yaml b/examples/csmsc/voc4/conf/default.yaml index cad4cf9bfcce4b11c8ce79afe5e89dca69005fa8..6f7d0f2b326ddd244b8a28f7a9e577964b3b24da 100644 --- a/examples/csmsc/voc4/conf/default.yaml +++ b/examples/csmsc/voc4/conf/default.yaml @@ -9,9 +9,9 @@ # FEATURE EXTRACTION SETTING # ########################################################### fs: 24000 # Sampling rate. -n_fft: 2048 # FFT size. (in samples) -n_shift: 300 # Hop size. (in samples) -win_length: 1200 # Window length. (in samples) +n_fft: 2048 # FFT size (samples). +n_shift: 300 # Hop size (samples). 12.5ms +win_length: 1200 # Window length (samples). 50ms # If set to null, it will be the same as fft_size. window: "hann" # Window function. n_mels: 80 # Number of mel basis. diff --git a/examples/csmsc/voc4/local/synthesize.sh b/examples/csmsc/voc4/local/synthesize.sh index 527e5f839d0300b93c4361e197a0492b205704c5..85d9a0c82e0a52f38394aa78efbcaafaf937bfa9 100755 --- a/examples/csmsc/voc4/local/synthesize.sh +++ b/examples/csmsc/voc4/local/synthesize.sh @@ -7,8 +7,8 @@ ckpt_name=$3 FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ python3 ${BIN_DIR}/../synthesize.py \ - --config=${config_path} \ - --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ - --test-metadata=dump/test/norm/metadata.jsonl \ - --output-dir=${train_output_path}/test \ - --generator-type=style_melgan + --config=${config_path} \ + --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ + --test-metadata=dump/test/norm/metadata.jsonl \ + --output-dir=${train_output_path}/test \ + --generator-type=style_melgan diff --git a/examples/csmsc/voc5/README.md b/examples/csmsc/voc5/README.md new file mode 100644 index 0000000000000000000000000000000000000000..2ced9f779a31a5086d5fc37ef813523040a953f4 --- /dev/null +++ b/examples/csmsc/voc5/README.md @@ -0,0 +1,117 @@ +# HiFiGAN with CSMSC +This example contains code used to train a [HiFiGAN](https://arxiv.org/abs/2010.05646) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html). +## Dataset +### Download and Extract +Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`. + +### Get MFA Result and Extract +We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio. +You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/mfa) of our repo. + +## Get Started +Assume the path to the dataset is `~/datasets/BZNSYP`. +Assume the path to the MFA result of CSMSC is `./baker_alignment_tone`. +Run the command below to +1. **source path**. +2. preprocess the dataset. +3. train the model. +4. synthesize wavs. + - synthesize waveform from `metadata.jsonl`. +```bash +./run.sh +``` +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset. +```bash +./run.sh --stage 0 --stop-stage 0 +``` +### Data Preprocessing +```bash +./local/preprocess.sh ${conf_path} +``` +When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below. + +```text +dump +├── dev +│ ├── norm +│ └── raw +├── test +│ ├── norm +│ └── raw +└── train + ├── norm + ├── raw + └── feats_stats.npy +``` +The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`. + +Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance. + +### Model Training +```bash +CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} +``` +`./local/train.sh` calls `${BIN_DIR}/train.py`. +Here's the complete help message. + +```text +usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA] + [--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR] + [--ngpu NGPU] [--verbose VERBOSE] + +Train a HiFiGAN model. + +optional arguments: + -h, --help show this help message and exit + --config CONFIG config file to overwrite default config. + --train-metadata TRAIN_METADATA + training data. + --dev-metadata DEV_METADATA + dev data. + --output-dir OUTPUT_DIR + output dir. + --ngpu NGPU if ngpu == 0, use cpu. + --verbose VERBOSE verbose. +``` + +1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. +2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. +3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory. +4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. + +### Synthesizing +`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`. +```bash +CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} +``` +```text +usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG] + [--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA] + [--output-dir OUTPUT_DIR] [--ngpu NGPU] + [--verbose VERBOSE] + +Synthesize with GANVocoder. + +optional arguments: + -h, --help show this help message and exit + --generator-type GENERATOR_TYPE + type of GANVocoder, should in {pwgan, mb_melgan, + style_melgan, } now + --config CONFIG GANVocoder config file. + --checkpoint CHECKPOINT + snapshot to load. + --test-metadata TEST_METADATA + dev data. + --output-dir OUTPUT_DIR + output dir. + --ngpu NGPU if ngpu == 0, use cpu. + --verbose VERBOSE verbose. +``` + +1. `--config` config file. You should use the same config with which the model is trained. +2. `--checkpoint` is the checkpoint to load. Pick one of the checkpoints from `checkpoints` inside the training output directory. +3. `--test-metadata` is the metadata of the test dataset. Use the `metadata.jsonl` in the `dev/norm` subfolder from the processed directory. +4. `--output-dir` is the directory to save the synthesized audio files. +5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. + +## Fine-tuning diff --git a/examples/csmsc/voc5/conf/default.yaml b/examples/csmsc/voc5/conf/default.yaml new file mode 100644 index 0000000000000000000000000000000000000000..5192d3897f96673a5d6ef81ad93feba5be56af8c --- /dev/null +++ b/examples/csmsc/voc5/conf/default.yaml @@ -0,0 +1,167 @@ +# This is the configuration file for CSMSC dataset. +# This configuration is based on HiFiGAN V1, which is an official configuration. +# But I found that the optimizer setting does not work well with my implementation. +# So I changed optimizer settings as follows: +# - AdamW -> Adam +# - betas: [0.8, 0.99] -> betas: [0.5, 0.9] +# - Scheduler: ExponentialLR -> MultiStepLR +# To match the shift size difference, the upsample scales is also modified from the original 256 shift setting. + +########################################################### +# FEATURE EXTRACTION SETTING # +########################################################### +fs: 24000 # Sampling rate. +n_fft: 2048 # FFT size (samples). +n_shift: 300 # Hop size (samples). 12.5ms +win_length: 1200 # Window length (samples). 50ms + # If set to null, it will be the same as fft_size. +window: "hann" # Window function. +n_mels: 80 # Number of mel basis. +fmin: 80 # Minimum freq in mel basis calculation. (Hz) +fmax: 7600 # Maximum frequency in mel basis calculation. (Hz) + +########################################################### +# GENERATOR NETWORK ARCHITECTURE SETTING # +########################################################### +generator_params: + in_channels: 80 # Number of input channels. + out_channels: 1 # Number of output channels. + channels: 512 # Number of initial channels. + kernel_size: 7 # Kernel size of initial and final conv layers. + upsample_scales: [5, 5, 4, 3] # Upsampling scales. + upsample_kernel_sizes: [10, 10, 8, 6] # Kernel size for upsampling layers. + resblock_kernel_sizes: [3, 7, 11] # Kernel size for residual blocks. + resblock_dilations: # Dilations for residual blocks. + - [1, 3, 5] + - [1, 3, 5] + - [1, 3, 5] + use_additional_convs: true # Whether to use additional conv layer in residual blocks. + bias: true # Whether to use bias parameter in conv. + nonlinear_activation: "leakyrelu" # Nonlinear activation type. + nonlinear_activation_params: # Nonlinear activation paramters. + negative_slope: 0.1 + use_weight_norm: true # Whether to apply weight normalization. + + +########################################################### +# DISCRIMINATOR NETWORK ARCHITECTURE SETTING # +########################################################### +discriminator_params: + scales: 3 # Number of multi-scale discriminator. + scale_downsample_pooling: "AvgPool1D" # Pooling operation for scale discriminator. + scale_downsample_pooling_params: + kernel_size: 4 # Pooling kernel size. + stride: 2 # Pooling stride. + padding: 2 # Padding size. + scale_discriminator_params: + in_channels: 1 # Number of input channels. + out_channels: 1 # Number of output channels. + kernel_sizes: [15, 41, 5, 3] # List of kernel sizes. + channels: 128 # Initial number of channels. + max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers. + max_groups: 16 # Maximum number of groups in downsampling conv layers. + bias: true + downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales. + nonlinear_activation: "leakyrelu" # Nonlinear activation. + nonlinear_activation_params: + negative_slope: 0.1 + follow_official_norm: true # Whether to follow the official norm setting. + periods: [2, 3, 5, 7, 11] # List of period for multi-period discriminator. + period_discriminator_params: + in_channels: 1 # Number of input channels. + out_channels: 1 # Number of output channels. + kernel_sizes: [5, 3] # List of kernel sizes. + channels: 32 # Initial number of channels. + downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales. + max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers. + bias: true # Whether to use bias parameter in conv layer." + nonlinear_activation: "leakyrelu" # Nonlinear activation. + nonlinear_activation_params: # Nonlinear activation paramters. + negative_slope: 0.1 + use_weight_norm: true # Whether to apply weight normalization. + use_spectral_norm: false # Whether to apply spectral normalization. + + +########################################################### +# STFT LOSS SETTING # +########################################################### +use_stft_loss: false # Whether to use multi-resolution STFT loss. +use_mel_loss: true # Whether to use Mel-spectrogram loss. +mel_loss_params: + fs: 24000 + fft_size: 2048 + hop_size: 300 + win_length: 1200 + window: "hann" + num_mels: 80 + fmin: 0 + fmax: 12000 + log_base: null +generator_adv_loss_params: + average_by_discriminators: false # Whether to average loss by #discriminators. +discriminator_adv_loss_params: + average_by_discriminators: false # Whether to average loss by #discriminators. +use_feat_match_loss: true +feat_match_loss_params: + average_by_discriminators: false # Whether to average loss by #discriminators. + average_by_layers: false # Whether to average loss by #layers in each discriminator. + include_final_outputs: false # Whether to include final outputs in feat match loss calculation. + +########################################################### +# ADVERSARIAL LOSS SETTING # +########################################################### +lambda_aux: 45.0 # Loss balancing coefficient for STFT loss. +lambda_adv: 1.0 # Loss balancing coefficient for adversarial loss. +lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss.. + +########################################################### +# DATA LOADER SETTING # +########################################################### +batch_size: 16 # Batch size. +batch_max_steps: 8400 # Length of each audio in batch. Make sure dividable by hop_size. +num_workers: 2 # Number of workers in Pytorch DataLoader. + +########################################################### +# OPTIMIZER & SCHEDULER SETTING # +########################################################### +generator_optimizer_params: + beta1: 0.5 + beta2: 0.9 + weight_decay: 0.0 # Generator's weight decay coefficient. +generator_scheduler_params: + learning_rate: 2.0e-4 # Generator's learning rate. + gamma: 0.5 # Generator's scheduler gamma. + milestones: # At each milestone, lr will be multiplied by gamma. + - 200000 + - 400000 + - 600000 + - 800000 +generator_grad_norm: -1 # Generator's gradient norm. +discriminator_optimizer_params: + beta1: 0.5 + beta2: 0.9 + weight_decay: 0.0 # Discriminator's weight decay coefficient. +discriminator_scheduler_params: + learning_rate: 2.0e-4 # Discriminator's learning rate. + gamma: 0.5 # Discriminator's scheduler gamma. + milestones: # At each milestone, lr will be multiplied by gamma. + - 200000 + - 400000 + - 600000 + - 800000 +discriminator_grad_norm: -1 # Discriminator's gradient norm. + +########################################################### +# INTERVAL SETTING # +########################################################### +generator_train_start_steps: 1 # Number of steps to start to train discriminator. +discriminator_train_start_steps: 0 # Number of steps to start to train discriminator. +train_max_steps: 2500000 # Number of training steps. +save_interval_steps: 5000 # Interval steps to save checkpoint. +eval_interval_steps: 1000 # Interval steps to evaluate the network. + +########################################################### +# OTHER SETTING # +########################################################### +num_snapshots: 10 # max number of snapshots to keep while training +seed: 42 # random seed for paddle, random, and np.random diff --git a/examples/csmsc/voc5/conf/finetune.yaml b/examples/csmsc/voc5/conf/finetune.yaml new file mode 100644 index 0000000000000000000000000000000000000000..9876e93d01c8be8d9d9499a35b56f60c49654af9 --- /dev/null +++ b/examples/csmsc/voc5/conf/finetune.yaml @@ -0,0 +1,168 @@ +# This is the configuration file for CSMSC dataset. +# This configuration is based on HiFiGAN V1, which is an official configuration. +# But I found that the optimizer setting does not work well with my implementation. +# So I changed optimizer settings as follows: +# - AdamW -> Adam +# - betas: [0.8, 0.99] -> betas: [0.5, 0.9] +# - Scheduler: ExponentialLR -> MultiStepLR +# To match the shift size difference, the upsample scales is also modified from the original 256 shift setting. + +########################################################### +# FEATURE EXTRACTION SETTING # +########################################################### +fs: 24000 # Sampling rate. +n_fft: 2048 # FFT size (samples). +n_shift: 300 # Hop size (samples). 12.5ms +win_length: 1200 # Window length (samples). 50ms + # If set to null, it will be the same as fft_size. +window: "hann" # Window function. +n_mels: 80 # Number of mel basis. +fmin: 80 # Minimum freq in mel basis calculation. (Hz) +fmax: 7600 # Maximum frequency in mel basis calculation. (Hz) + +########################################################### +# GENERATOR NETWORK ARCHITECTURE SETTING # +########################################################### +generator_params: + in_channels: 80 # Number of input channels. + out_channels: 1 # Number of output channels. + channels: 512 # Number of initial channels. + kernel_size: 7 # Kernel size of initial and final conv layers. + upsample_scales: [5, 5, 4, 3] # Upsampling scales. + upsample_kernel_sizes: [10, 10, 8, 6] # Kernel size for upsampling layers. + resblock_kernel_sizes: [3, 7, 11] # Kernel size for residual blocks. + resblock_dilations: # Dilations for residual blocks. + - [1, 3, 5] + - [1, 3, 5] + - [1, 3, 5] + use_additional_convs: true # Whether to use additional conv layer in residual blocks. + bias: true # Whether to use bias parameter in conv. + nonlinear_activation: "leakyrelu" # Nonlinear activation type. + nonlinear_activation_params: # Nonlinear activation paramters. + negative_slope: 0.1 + use_weight_norm: true # Whether to apply weight normalization. + + +########################################################### +# DISCRIMINATOR NETWORK ARCHITECTURE SETTING # +########################################################### +discriminator_params: + scales: 3 # Number of multi-scale discriminator. + scale_downsample_pooling: "AvgPool1D" # Pooling operation for scale discriminator. + scale_downsample_pooling_params: + kernel_size: 4 # Pooling kernel size. + stride: 2 # Pooling stride. + padding: 2 # Padding size. + scale_discriminator_params: + in_channels: 1 # Number of input channels. + out_channels: 1 # Number of output channels. + kernel_sizes: [15, 41, 5, 3] # List of kernel sizes. + channels: 128 # Initial number of channels. + max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers. + max_groups: 16 # Maximum number of groups in downsampling conv layers. + bias: true + downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales. + nonlinear_activation: "leakyrelu" # Nonlinear activation. + nonlinear_activation_params: + negative_slope: 0.1 + follow_official_norm: true # Whether to follow the official norm setting. + periods: [2, 3, 5, 7, 11] # List of period for multi-period discriminator. + period_discriminator_params: + in_channels: 1 # Number of input channels. + out_channels: 1 # Number of output channels. + kernel_sizes: [5, 3] # List of kernel sizes. + channels: 32 # Initial number of channels. + downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales. + max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers. + bias: true # Whether to use bias parameter in conv layer." + nonlinear_activation: "leakyrelu" # Nonlinear activation. + nonlinear_activation_params: # Nonlinear activation paramters. + negative_slope: 0.1 + use_weight_norm: true # Whether to apply weight normalization. + use_spectral_norm: false # Whether to apply spectral normalization. + + +########################################################### +# STFT LOSS SETTING # +########################################################### +use_stft_loss: false # Whether to use multi-resolution STFT loss. +use_mel_loss: true # Whether to use Mel-spectrogram loss. +mel_loss_params: + fs: 24000 + fft_size: 2048 + hop_size: 300 + win_length: 1200 + window: "hann" + num_mels: 80 + fmin: 0 + fmax: 12000 + log_base: null +generator_adv_loss_params: + average_by_discriminators: false # Whether to average loss by #discriminators. +discriminator_adv_loss_params: + average_by_discriminators: false # Whether to average loss by #discriminators. +use_feat_match_loss: true +feat_match_loss_params: + average_by_discriminators: false # Whether to average loss by #discriminators. + average_by_layers: false # Whether to average loss by #layers in each discriminator. + include_final_outputs: false # Whether to include final outputs in feat match loss calculation. + +########################################################### +# ADVERSARIAL LOSS SETTING # +########################################################### +lambda_aux: 45.0 # Loss balancing coefficient for STFT loss. +lambda_adv: 1.0 # Loss balancing coefficient for adversarial loss. +lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss.. + +########################################################### +# DATA LOADER SETTING # +########################################################### +batch_size: 16 # Batch size. +batch_max_steps: 8400 # Length of each audio in batch. Make sure dividable by hop_size. +num_workers: 2 # Number of workers in Pytorch DataLoader. + +########################################################### +# OPTIMIZER & SCHEDULER SETTING # +########################################################### +generator_optimizer_params: + beta1: 0.5 + beta2: 0.9 + weight_decay: 0.0 # Generator's weight decay coefficient. +generator_scheduler_params: + learning_rate: 2.0e-4 # Generator's learning rate. + gamma: 0.5 # Generator's scheduler gamma. + milestones: # At each milestone, lr will be multiplied by gamma. + - 200000 + - 400000 + - 600000 + - 800000 +generator_grad_norm: -1 # Generator's gradient norm. +discriminator_optimizer_params: + beta1: 0.5 + beta2: 0.9 + weight_decay: 0.0 # Discriminator's weight decay coefficient. +discriminator_scheduler_params: + learning_rate: 2.0e-4 # Discriminator's learning rate. + gamma: 0.5 # Discriminator's scheduler gamma. + milestones: # At each milestone, lr will be multiplied by gamma. + - 200000 + - 400000 + - 600000 + - 800000 +discriminator_grad_norm: -1 # Discriminator's gradient norm. + +########################################################### +# INTERVAL SETTING # +########################################################### +generator_train_start_steps: 1 # Number of steps to start to train discriminator. +discriminator_train_start_steps: 0 # Number of steps to start to train discriminator. +train_max_steps: 2500000 # Number of training steps. +save_interval_steps: 10000 # Interval steps to save checkpoint. +eval_interval_steps: 1000 # Interval steps to evaluate the network. +log_interval_steps: 100 # Interval steps to record the training log. + +########################################################### +# OTHER SETTING # +########################################################### +num_snapshots: 10 # max number of snapshots to keep while training +seed: 42 # random seed for paddle, random, and np.random diff --git a/examples/csmsc/voc5/finetune.sh b/examples/csmsc/voc5/finetune.sh new file mode 100755 index 0000000000000000000000000000000000000000..4ab10e5b3775af433ae9c9f6a2e82966ca3361ba --- /dev/null +++ b/examples/csmsc/voc5/finetune.sh @@ -0,0 +1,62 @@ +#!/bin/bash + +source path.sh + +gpus=0 +stage=0 +stop_stage=100 + +source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 + +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + python3 ${MAIN_ROOT}/paddlespeech/t2s/exps/fastspeech2/gen_gta_mel.py \ + --fastspeech2-config=fastspeech2_nosil_baker_ckpt_0.4/default.yaml \ + --fastspeech2-checkpoint=fastspeech2_nosil_baker_ckpt_0.4/snapshot_iter_76000.pdz \ + --fastspeech2-stat=fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy \ + --dur-file=durations.txt \ + --output-dir=dump_finetune \ + --phones-dict=fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt +fi + +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + python3 local/link_wav.py \ + --old-dump-dir=dump \ + --dump-dir=dump_finetune +fi + +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + # get features' stats(mean and std) + echo "Get features' stats ..." + cp dump/train/feats_stats.npy dump_finetune/train/ +fi + +if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then + # normalize, dev and test should use train's stats + echo "Normalize ..." + + python3 ${BIN_DIR}/../normalize.py \ + --metadata=dump_finetune/train/raw/metadata.jsonl \ + --dumpdir=dump_finetune/train/norm \ + --stats=dump_finetune/train/feats_stats.npy + python3 ${BIN_DIR}/../normalize.py \ + --metadata=dump_finetune/dev/raw/metadata.jsonl \ + --dumpdir=dump_finetune/dev/norm \ + --stats=dump_finetune/train/feats_stats.npy + + python3 ${BIN_DIR}/../normalize.py \ + --metadata=dump_finetune/test/raw/metadata.jsonl \ + --dumpdir=dump_finetune/test/norm \ + --stats=dump_finetune/train/feats_stats.npy +fi + +if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then + CUDA_VISIBLE_DEVICES=${gpus} \ + FLAGS_cudnn_exhaustive_search=true \ + FLAGS_conv_workspace_size_limit=4000 \ + python ${BIN_DIR}/train.py \ + --train-metadata=dump_finetune/train/norm/metadata.jsonl \ + --dev-metadata=dump_finetune/dev/norm/metadata.jsonl \ + --config=conf/finetune.yaml \ + --output-dir=exp/finetune \ + --ngpu=1 +fi \ No newline at end of file diff --git a/examples/csmsc/voc5/local/link_wav.py b/examples/csmsc/voc5/local/link_wav.py new file mode 100644 index 0000000000000000000000000000000000000000..c81e0d4b83320665b98720d09a940e9de6dc63cd --- /dev/null +++ b/examples/csmsc/voc5/local/link_wav.py @@ -0,0 +1,85 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import os +from operator import itemgetter +from pathlib import Path + +import jsonlines +import numpy as np + + +def main(): + # parse config and args + parser = argparse.ArgumentParser( + description="Preprocess audio and then extract features .") + + parser.add_argument( + "--old-dump-dir", + default=None, + type=str, + help="directory to dump feature files.") + parser.add_argument( + "--dump-dir", + type=str, + required=True, + help="directory to finetune dump feature files.") + args = parser.parse_args() + + old_dump_dir = Path(args.old_dump_dir).expanduser() + old_dump_dir = old_dump_dir.resolve() + dump_dir = Path(args.dump_dir).expanduser() + # use absolute path + dump_dir = dump_dir.resolve() + dump_dir.mkdir(parents=True, exist_ok=True) + + assert old_dump_dir.is_dir() + assert dump_dir.is_dir() + + for sub in ["train", "dev", "test"]: + # 把 old_dump_dir 里面的 *-wave.npy 软连接到 dump_dir 的对应位置 + output_dir = dump_dir / sub + output_dir.mkdir(parents=True, exist_ok=True) + results = [] + for name in os.listdir(output_dir / "raw"): + # 003918_feats.npy + utt_id = name.split("_")[0] + mel_path = output_dir / ("raw/" + name) + gen_mel = np.load(mel_path) + wave_name = utt_id + "_wave.npy" + wav = np.load(old_dump_dir / sub / ("raw/" + wave_name)) + os.symlink(old_dump_dir / sub / ("raw/" + wave_name), + output_dir / ("raw/" + wave_name)) + num_sample = wav.shape[0] + num_frames = gen_mel.shape[0] + wav_path = output_dir / ("raw/" + wave_name) + + record = { + "utt_id": utt_id, + "num_samples": num_sample, + "num_frames": num_frames, + "feats": str(mel_path), + "wave": str(wav_path), + } + results.append(record) + + results.sort(key=itemgetter("utt_id")) + + with jsonlines.open(output_dir / "raw/metadata.jsonl", 'w') as writer: + for item in results: + writer.write(item) + + +if __name__ == "__main__": + main() diff --git a/examples/csmsc/voc5/local/preprocess.sh b/examples/csmsc/voc5/local/preprocess.sh new file mode 100755 index 0000000000000000000000000000000000000000..61d6d62bef566d385c4d3d2407ce437ec6d8e9ad --- /dev/null +++ b/examples/csmsc/voc5/local/preprocess.sh @@ -0,0 +1,55 @@ +#!/bin/bash + +stage=0 +stop_stage=100 + +config_path=$1 + +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + # get durations from MFA's result + echo "Generate durations.txt from MFA results ..." + python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \ + --inputdir=./baker_alignment_tone \ + --output=durations.txt \ + --config=${config_path} +fi + +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + # extract features + echo "Extract features ..." + python3 ${BIN_DIR}/../preprocess.py \ + --rootdir=~/datasets/BZNSYP/ \ + --dataset=baker \ + --dumpdir=dump \ + --dur-file=durations.txt \ + --config=${config_path} \ + --cut-sil=True \ + --num-cpu=20 +fi + +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + # get features' stats(mean and std) + echo "Get features' stats ..." + python3 ${MAIN_ROOT}/utils/compute_statistics.py \ + --metadata=dump/train/raw/metadata.jsonl \ + --field-name="feats" +fi + +if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then + # normalize, dev and test should use train's stats + echo "Normalize ..." + + python3 ${BIN_DIR}/../normalize.py \ + --metadata=dump/train/raw/metadata.jsonl \ + --dumpdir=dump/train/norm \ + --stats=dump/train/feats_stats.npy + python3 ${BIN_DIR}/../normalize.py \ + --metadata=dump/dev/raw/metadata.jsonl \ + --dumpdir=dump/dev/norm \ + --stats=dump/train/feats_stats.npy + + python3 ${BIN_DIR}/../normalize.py \ + --metadata=dump/test/raw/metadata.jsonl \ + --dumpdir=dump/test/norm \ + --stats=dump/train/feats_stats.npy +fi diff --git a/examples/csmsc/voc5/local/synthesize.sh b/examples/csmsc/voc5/local/synthesize.sh new file mode 100755 index 0000000000000000000000000000000000000000..6478961756f0b3659a83ebe6036de8223a87b7aa --- /dev/null +++ b/examples/csmsc/voc5/local/synthesize.sh @@ -0,0 +1,14 @@ +#!/bin/bash + +config_path=$1 +train_output_path=$2 +ckpt_name=$3 + +FLAGS_allocator_strategy=naive_best_fit \ +FLAGS_fraction_of_gpu_memory_to_use=0.01 \ +python3 ${BIN_DIR}/../synthesize.py \ + --config=${config_path} \ + --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ + --test-metadata=dump/test/norm/metadata.jsonl \ + --output-dir=${train_output_path}/test \ + --generator-type=hifigan diff --git a/examples/csmsc/voc5/local/train.sh b/examples/csmsc/voc5/local/train.sh new file mode 100755 index 0000000000000000000000000000000000000000..9695631ef023795f6c54e5c11bbcff5b6a6b2998 --- /dev/null +++ b/examples/csmsc/voc5/local/train.sh @@ -0,0 +1,13 @@ +#!/bin/bash + +config_path=$1 +train_output_path=$2 + +FLAGS_cudnn_exhaustive_search=true \ +FLAGS_conv_workspace_size_limit=4000 \ +python ${BIN_DIR}/train.py \ + --train-metadata=dump/train/norm/metadata.jsonl \ + --dev-metadata=dump/dev/norm/metadata.jsonl \ + --config=${config_path} \ + --output-dir=${train_output_path} \ + --ngpu=1 diff --git a/examples/csmsc/voc5/path.sh b/examples/csmsc/voc5/path.sh new file mode 100755 index 0000000000000000000000000000000000000000..2ec84a9bef0620d6668e3a32faee11fbd0af85d3 --- /dev/null +++ b/examples/csmsc/voc5/path.sh @@ -0,0 +1,13 @@ +#!/bin/bash +export MAIN_ROOT=`realpath ${PWD}/../../../` + +export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH} +export LC_ALL=C + +export PYTHONDONTWRITEBYTECODE=1 +# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C +export PYTHONIOENCODING=UTF-8 +export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH} + +MODEL=hifigan +export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/gan_vocoder/${MODEL} \ No newline at end of file diff --git a/examples/csmsc/voc5/run.sh b/examples/csmsc/voc5/run.sh new file mode 100755 index 0000000000000000000000000000000000000000..3e7d7e2ab61882a7985c0de110803815071afd1a --- /dev/null +++ b/examples/csmsc/voc5/run.sh @@ -0,0 +1,32 @@ +#!/bin/bash + +set -e +source path.sh + +gpus=0,1 +stage=0 +stop_stage=100 + +conf_path=conf/default.yaml +train_output_path=exp/default +ckpt_name=snapshot_iter_50000.pdz + +# with the following command, you can choose the stage range you want to run +# such as `./run.sh --stage 0 --stop-stage 0` +# this can not be mixed use with `$1`, `$2` ... +source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 + +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + # prepare data + ./local/preprocess.sh ${conf_path} || exit -1 +fi + +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + # train model, all `ckpt` under `train_output_path/checkpoints/` dir + CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1 +fi + +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + # synthesize + CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1 +fi diff --git a/examples/librispeech/asr1/RESULTS.md b/examples/librispeech/asr1/RESULTS.md index d5f5a9a46b310e0b88755f2225715eee1b90f299..10f0fe33d97d56483d37a656813c508ffd15c9b1 100644 --- a/examples/librispeech/asr1/RESULTS.md +++ b/examples/librispeech/asr1/RESULTS.md @@ -30,4 +30,4 @@ train: Epoch 120, 4 V100-32G, 27 Day, best avg: 10 | transformer | 32.52 M | conf/transformer.yaml | spec_aug | test-clean | attention | 6.382194232940674 | 0.049661 | | transformer | 32.52 M | conf/transformer.yaml | spec_aug | test-clean | ctc_greedy_search | 6.382194232940674 | 0.049566 | | transformer | 32.52 M | conf/transformer.yaml | spec_aug | test-clean | ctc_prefix_beam_search | 6.382194232940674 | 0.049585 | -| transformer | 32.52 M | conf/transformer.yaml | spec_aug | test-clean | attention_rescoring | 6.382194232940674 | 0.038135 | +| transformer | 32.52 M | conf/transformer.yaml | spec_aug | test-clean | attention_rescoring | 6.382194232940674 | 0.038135 | diff --git a/examples/ljspeech/tts0/README.md b/examples/ljspeech/tts0/README.md index d49d6d6b60103b74b165f493895e2f306e14caec..baaec818306b30f2a8a049a5ba2eeb2225b90fff 100644 --- a/examples/ljspeech/tts0/README.md +++ b/examples/ljspeech/tts0/README.md @@ -1,5 +1,5 @@ # Tacotron2 with LJSpeech -PaddlePaddle dynamic graph implementation of Tacotron2, a neural network architecture for speech synthesis directly from text. The implementation is based on [Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](https://arxiv.org/abs/1712.05884). +PaddlePaddle dynamic graph implementation of Tacotron2, a neural network architecture for speech synthesis directly from the text. The implementation is based on [Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](https://arxiv.org/abs/1712.05884). ## Dataset We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/). @@ -18,7 +18,7 @@ Run the command below to ```bash ./run.sh ``` -You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset. ```bash ./run.sh --stage 0 --stop-stage 0 ``` @@ -40,7 +40,7 @@ optional arguments: -h, --help show this help message and exit --config FILE path of the config file to overwrite to default config with. - --data DATA_DIR path to the datatset. + --data DATA_DIR path to the dataset. --output OUTPUT_DIR path to save checkpoint and logs. --checkpoint_path CHECKPOINT_PATH path of the checkpoint to load @@ -50,9 +50,9 @@ optional arguments: ``` If you want to train on CPU, just set `--ngpu=0`. -If you want to train on multiple GPUs, just set `--ngpu` as num of GPU. +If you want to train on multiple GPUs, just set `--ngpu` as the num of GPU. By default, training will be resumed from the latest checkpoint in `--output`, if you want to start a new training, please use a new `${OUTPUTPATH}` with no checkpoint. -And if you want to resume from an other existing model, you should set `checkpoint_path` to be the checkpoint path you want to load. +And if you want to resume from another existing model, you should set `checkpoint_path` to be the checkpoint path you want to load. **Note: The checkpoint path cannot contain the file extension.** ### Synthesizing @@ -79,11 +79,11 @@ optional arguments: config, passing in KEY VALUE pairs -v, --verbose print msg ``` -**Ps.** You can use [waveflow](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0) as the neural vocoder to synthesize mels to wavs. (Please refer to `synthesize.sh` in our LJSpeech waveflow example) +**Ps.** You can use [waveflow](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0) as the neural vocoder to synthesize mels to wavs. (Please refer to `synthesize.sh` in our LJSpeech waveflow example) ## Pretrained Models -Pretrained Models can be downloaded from links below. We provide 2 models with different configurations. +Pretrained Models can be downloaded from the links below. We provide 2 models with different configurations. -1. This model use a binary classifier to predict the stop token. [tacotron2_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.3.zip) +1. This model uses a binary classifier to predict the stop token. [tacotron2_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.3.zip) -2. This model does not have a stop token predictor. It uses the attention peak position to decided whether all the contents have been uttered. Also guided attention loss is used to speed up training. This model is trained with `configs/alternative.yaml`.[tacotron2_ljspeech_ckpt_0.3_alternative.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.3_alternative.zip) +2. This model does not have a stop token predictor. It uses the attention peak position to decide whether all the contents have been uttered. Also, guided attention loss is used to speed up training. This model is trained with `configs/alternative.yaml`.[tacotron2_ljspeech_ckpt_0.3_alternative.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.3_alternative.zip) diff --git a/examples/ljspeech/tts1/README.md b/examples/ljspeech/tts1/README.md index c2d0c59e821c5ad9d3116f05370dcd44b0302961..5bb163e1d70aa9df20c2759f8068cb0cc7f24f70 100644 --- a/examples/ljspeech/tts1/README.md +++ b/examples/ljspeech/tts1/README.md @@ -18,7 +18,7 @@ Run the command below to ```bash ./run.sh ``` -You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset. ```bash ./run.sh --stage 0 --stop-stage 0 ``` @@ -42,9 +42,9 @@ dump ├── raw └── speech_stats.npy ``` -The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains speech feature of each utterances, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/speech_stats.npy`. +The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains the speech feature of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/speech_stats.npy`. -Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, path of speech features, speaker and id of each utterance. +Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, the path of speech features, speaker, and id of each utterance. ### Model Training `./local/train.sh` calls `${BIN_DIR}/train.py`. @@ -75,7 +75,7 @@ optional arguments: ``` 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. -3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. +3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 5. `--phones-dict` is the path of the phone vocabulary file. @@ -85,7 +85,7 @@ Download Pretrained WaveFlow Model with residual channel equals 128 from [wavefl ```bash unzip waveflow_ljspeech_ckpt_0.3.zip ``` -WaveFlow checkpoint contains files listed below. +WaveFlow checkpoint contains files listed below. ```text waveflow_ljspeech_ckpt_0.3 ├── config.yaml # default config used to train waveflow diff --git a/examples/ljspeech/tts1/conf/default.yaml b/examples/ljspeech/tts1/conf/default.yaml index 350212da70162d76a5e2a2ef086a9a7549adbb07..6b495effc8d0f86aa3e872f056eb4cd259eb3327 100644 --- a/examples/ljspeech/tts1/conf/default.yaml +++ b/examples/ljspeech/tts1/conf/default.yaml @@ -1,8 +1,8 @@ fs : 22050 # Hz, sample rate -n_fft : 1024 # fft frame size -win_length : 1024 # window size -n_shift : 256 # hop size between ajacent frame +n_fft : 1024 # FFT size (samples). +win_length : 1024 # Window length (samples). 46.4ms +n_shift : 256 # Hop size (samples). 11.6ms fmin : 0 # Hz, min frequency when converting to mel fmax : 8000 # Hz, max frequency when converting to mel n_mels : 80 # mel bands diff --git a/examples/ljspeech/tts1/local/synthesize.sh b/examples/ljspeech/tts1/local/synthesize.sh index 9fe837a44e62dec8e0c504d58bfd328e14cb5350..9d1c47b39f73fc7738018ec37220de23a102dd6c 100755 --- a/examples/ljspeech/tts1/local/synthesize.sh +++ b/examples/ljspeech/tts1/local/synthesize.sh @@ -7,11 +7,11 @@ ckpt_name=$3 FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ python3 ${BIN_DIR}/synthesize.py \ - --transformer-tts-config=${config_path} \ - --transformer-tts-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ - --transformer-tts-stat=dump/train/speech_stats.npy \ - --waveflow-config=waveflow_ljspeech_ckpt_0.3/config.yaml \ - --waveflow-checkpoint=waveflow_ljspeech_ckpt_0.3/step-2000000.pdparams \ - --test-metadata=dump/test/norm/metadata.jsonl \ - --output-dir=${train_output_path}/test \ - --phones-dict=dump/phone_id_map.txt + --transformer-tts-config=${config_path} \ + --transformer-tts-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ + --transformer-tts-stat=dump/train/speech_stats.npy \ + --waveflow-config=waveflow_ljspeech_ckpt_0.3/config.yaml \ + --waveflow-checkpoint=waveflow_ljspeech_ckpt_0.3/step-2000000.pdparams \ + --test-metadata=dump/test/norm/metadata.jsonl \ + --output-dir=${train_output_path}/test \ + --phones-dict=dump/phone_id_map.txt diff --git a/examples/ljspeech/tts1/local/synthesize_e2e.sh b/examples/ljspeech/tts1/local/synthesize_e2e.sh index 046fdb7087c9b29fda13b7f483cfb45112331c04..25a862f9007499d6e1703ec54831884fd86274d4 100755 --- a/examples/ljspeech/tts1/local/synthesize_e2e.sh +++ b/examples/ljspeech/tts1/local/synthesize_e2e.sh @@ -7,11 +7,11 @@ ckpt_name=$3 FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ python3 ${BIN_DIR}/synthesize_e2e.py \ - --transformer-tts-config=${config_path} \ - --transformer-tts-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ - --transformer-tts-stat=dump/train/speech_stats.npy \ - --waveflow-config=waveflow_ljspeech_ckpt_0.3/config.yaml \ - --waveflow-checkpoint=waveflow_ljspeech_ckpt_0.3/step-2000000.pdparams \ - --text=${BIN_DIR}/../sentences_en.txt \ - --output-dir=${train_output_path}/test_e2e \ - --phones-dict=dump/phone_id_map.txt + --transformer-tts-config=${config_path} \ + --transformer-tts-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ + --transformer-tts-stat=dump/train/speech_stats.npy \ + --waveflow-config=waveflow_ljspeech_ckpt_0.3/config.yaml \ + --waveflow-checkpoint=waveflow_ljspeech_ckpt_0.3/step-2000000.pdparams \ + --text=${BIN_DIR}/../sentences_en.txt \ + --output-dir=${train_output_path}/test_e2e \ + --phones-dict=dump/phone_id_map.txt diff --git a/examples/ljspeech/tts3/README.md b/examples/ljspeech/tts3/README.md index 93f14edc39304d61563f185fe67e810e392dc7e5..692c9746a09e66bebc3e1dbc0ed371cdc9de0e0c 100644 --- a/examples/ljspeech/tts3/README.md +++ b/examples/ljspeech/tts3/README.md @@ -7,7 +7,7 @@ Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech ### Get MFA Result and Extract We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2. -You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. +You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. ## Get Started Assume the path to the dataset is `~/datasets/LJSpeech-1.1`. @@ -22,7 +22,7 @@ Run the command below to ```bash ./run.sh ``` -You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset. ```bash ./run.sh --stage 0 --stop-stage 0 ``` @@ -49,9 +49,9 @@ dump ├── raw └── speech_stats.npy ``` -The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains speech、pitch and energy features of each utterances, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`. +The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech、pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`. -Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, durations, path of speech features, path of pitch features, path of energy features, speaker and id of each utterance. +Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, the path of pitch features, the path of energy features, speaker, and id of each utterance. ### Model Training `./local/train.sh` calls `${BIN_DIR}/train.py`. @@ -85,7 +85,7 @@ optional arguments: ``` 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. -3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. +3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 5. `--phones-dict` is the path of the phone vocabulary file. @@ -102,97 +102,115 @@ pwg_ljspeech_ckpt_0.5 ├── pwg_snapshot_iter_400000.pdz # generator parameters of parallel wavegan └── pwg_stats.npy # statistics used to normalize spectrogram when training parallel wavegan ``` -`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`. +`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} ``` -```text -usage: synthesize.py [-h] [--fastspeech2-config FASTSPEECH2_CONFIG] - [--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT] - [--fastspeech2-stat FASTSPEECH2_STAT] - [--pwg-config PWG_CONFIG] - [--pwg-checkpoint PWG_CHECKPOINT] [--pwg-stat PWG_STAT] - [--phones-dict PHONES_DICT] [--speaker-dict SPEAKER_DICT] - [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR] - [--ngpu NGPU] [--verbose VERBOSE] - -Synthesize with fastspeech2 & parallel wavegan. +``text +usage: synthesize.py [-h] + [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}] + [--am_config AM_CONFIG] [--am_ckpt AM_CKPT] + [--am_stat AM_STAT] [--phones_dict PHONES_DICT] + [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT] + [--voice-cloning VOICE_CLONING] + [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}] + [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] + [--voc_stat VOC_STAT] [--ngpu NGPU] + [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR] + +Synthesize with acoustic model & vocoder optional arguments: -h, --help show this help message and exit - --fastspeech2-config FASTSPEECH2_CONFIG - fastspeech2 config file. - --fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT - fastspeech2 checkpoint to load. - --fastspeech2-stat FASTSPEECH2_STAT - mean and standard deviation used to normalize - spectrogram when training fastspeech2. - --pwg-config PWG_CONFIG - parallel wavegan config file. - --pwg-checkpoint PWG_CHECKPOINT - parallel wavegan generator parameters to load. - --pwg-stat PWG_STAT mean and standard deviation used to normalize - spectrogram when training parallel wavegan. - --phones-dict PHONES_DICT + --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk} + Choose acoustic model type of tts task. + --am_config AM_CONFIG + Config of acoustic model. Use deault config when it is + None. + --am_ckpt AM_CKPT Checkpoint file of acoustic model. + --am_stat AM_STAT mean and standard deviation used to normalize + spectrogram when training acoustic model. + --phones_dict PHONES_DICT phone vocabulary file. - --speaker-dict SPEAKER_DICT - speaker id map file for multiple speaker model. - --test-metadata TEST_METADATA + --tones_dict TONES_DICT + tone vocabulary file. + --speaker_dict SPEAKER_DICT + speaker id map file. + --voice-cloning VOICE_CLONING + whether training voice cloning model. + --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc} + Choose vocoder type of tts task. + --voc_config VOC_CONFIG + Config of voc. Use deault config when it is None. + --voc_ckpt VOC_CKPT Checkpoint file of voc. + --voc_stat VOC_STAT mean and standard deviation used to normalize + spectrogram when training voc. + --ngpu NGPU if ngpu == 0, use cpu. + --test_metadata TEST_METADATA test metadata. - --output-dir OUTPUT_DIR + --output_dir OUTPUT_DIR output dir. - --ngpu NGPU if ngpu == 0, use cpu. - --verbose VERBOSE verbose. ``` -`./local/synthesize_e2e.sh` calls `${BIN_DIR}/synthesize_e2e_en.py`, which can synthesize waveform from text file. +`./local/synthesize_e2e.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} ``` ```text -usage: synthesize_e2e.py [-h] [--fastspeech2-config FASTSPEECH2_CONFIG] - [--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT] - [--fastspeech2-stat FASTSPEECH2_STAT] - [--pwg-config PWG_CONFIG] - [--pwg-checkpoint PWG_CHECKPOINT] - [--pwg-stat PWG_STAT] [--phones-dict PHONES_DICT] - [--text TEXT] [--output-dir OUTPUT_DIR] - [--inference-dir INFERENCE_DIR] [--ngpu NGPU] - [--verbose VERBOSE] - -Synthesize with fastspeech2 & parallel wavegan. +usage: synthesize_e2e.py [-h] + [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}] + [--am_config AM_CONFIG] [--am_ckpt AM_CKPT] + [--am_stat AM_STAT] [--phones_dict PHONES_DICT] + [--tones_dict TONES_DICT] + [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID] + [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}] + [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] + [--voc_stat VOC_STAT] [--lang LANG] + [--inference_dir INFERENCE_DIR] [--ngpu NGPU] + [--text TEXT] [--output_dir OUTPUT_DIR] + +Synthesize with acoustic model & vocoder optional arguments: -h, --help show this help message and exit - --fastspeech2-config FASTSPEECH2_CONFIG - fastspeech2 config file. - --fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT - fastspeech2 checkpoint to load. - --fastspeech2-stat FASTSPEECH2_STAT - mean and standard deviation used to normalize - spectrogram when training fastspeech2. - --pwg-config PWG_CONFIG - parallel wavegan config file. - --pwg-checkpoint PWG_CHECKPOINT - parallel wavegan generator parameters to load. - --pwg-stat PWG_STAT mean and standard deviation used to normalize - spectrogram when training parallel wavegan. - --phones-dict PHONES_DICT + --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk} + Choose acoustic model type of tts task. + --am_config AM_CONFIG + Config of acoustic model. Use deault config when it is + None. + --am_ckpt AM_CKPT Checkpoint file of acoustic model. + --am_stat AM_STAT mean and standard deviation used to normalize + spectrogram when training acoustic model. + --phones_dict PHONES_DICT phone vocabulary file. - --text TEXT text to synthesize, a 'utt_id sentence' pair per line. - --output-dir OUTPUT_DIR - output dir. - --inference-dir INFERENCE_DIR + --tones_dict TONES_DICT + tone vocabulary file. + --speaker_dict SPEAKER_DICT + speaker id map file. + --spk_id SPK_ID spk id for multi speaker acoustic model + --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc} + Choose vocoder type of tts task. + --voc_config VOC_CONFIG + Config of voc. Use deault config when it is None. + --voc_ckpt VOC_CKPT Checkpoint file of voc. + --voc_stat VOC_STAT mean and standard deviation used to normalize + spectrogram when training voc. + --lang LANG Choose model language. zh or en + --inference_dir INFERENCE_DIR dir to save inference models --ngpu NGPU if ngpu == 0, use cpu. - --verbose VERBOSE verbose. + --text TEXT text to synthesize, a 'utt_id sentence' pair per line. + --output_dir OUTPUT_DIR + output dir. ``` - -1. `--fastspeech2-config`, `--fastspeech2-checkpoint`, `--fastspeech2-stat` and `--phones-dict` are arguments for fastspeech2, which correspond to the 4 files in the fastspeech2 pretrained model. -2. `--pwg-config`, `--pwg-checkpoint`, `--pwg-stat` are arguments for parallel wavegan, which correspond to the 3 files in the parallel wavegan pretrained model. -3. `--test-metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder. -4. `--text` is the text file, which contains sentences to synthesize. -5. `--output-dir` is the directory to save synthesized audio files. -6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. +1. `--am` is acoustic model type with the format {model_name}_{dataset} +2. `--am_config`, `--am_checkpoint`, `--am_stat` and `--phones_dict` are arguments for acoustic model, which correspond to the 4 files in the fastspeech2 pretrained model. +3. `--voc` is vocoder type with the format {model_name}_{dataset} +4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model. +5. `--lang` is the model language, which can be `zh` or `en`. +6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder. +7. `--text` is the text file, which contains sentences to synthesize. +8. `--output_dir` is the directory to save synthesized audio files. +9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ## Pretrained Model Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip) @@ -216,14 +234,18 @@ source path.sh FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ -python3 ${BIN_DIR}/synthesize_e2e_en.py \ - --fastspeech2-config=fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml \ - --fastspeech2-checkpoint=fastspeech2_nosil_ljspeech_ckpt_0.5/snapshot_iter_100000.pdz \ - --fastspeech2-stat=fastspeech2_nosil_ljspeech_ckpt_0.5/speech_stats.npy \ - --pwg-config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \ - --pwg-checkpoint=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz \ - --pwg-stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \ +python3 ${BIN_DIR}/../synthesize_e2e.py \ + --am=fastspeech2_ljspeech \ + --am_config=fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml \ + --am_ckpt=fastspeech2_nosil_ljspeech_ckpt_0.5/snapshot_iter_100000.pdz \ + --am_stat=fastspeech2_nosil_ljspeech_ckpt_0.5/speech_stats.npy \ + --voc=pwgan_ljspeech\ + --voc_config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \ + --voc_ckpt=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz \ + --voc_stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \ + --lang=en \ --text=${BIN_DIR}/../sentences_en.txt \ - --output-dir=exp/default/test_e2e \ - --phones-dict=fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt + --output_dir=exp/default/test_e2e \ + --inference_dir=exp/default/inference \ + --phones_dict=fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt ``` diff --git a/examples/ljspeech/tts3/conf/default.yaml b/examples/ljspeech/tts3/conf/default.yaml index e96422a1902d653e184c5afe39def0f8106200a5..872dafcbe35aa10fa72c90f1eead4bbc242e3ebb 100644 --- a/examples/ljspeech/tts3/conf/default.yaml +++ b/examples/ljspeech/tts3/conf/default.yaml @@ -3,9 +3,9 @@ ########################################################### fs: 22050 # sr -n_fft: 1024 # FFT size. -n_shift: 256 # Hop size. -win_length: null # Window length. +n_fft: 1024 # FFT size (samples). +n_shift: 256 # Hop size (samples). 11.6ms +win_length: null # Window length (samples). # If set to null, it will be the same as fft_size. window: "hann" # Window function. diff --git a/examples/ljspeech/tts3/local/synthesize.sh b/examples/ljspeech/tts3/local/synthesize.sh index 9b22abb3c8b65367b207f2b1b8b65c707f629471..f150d158f6832cecfd0eff5028d5dd716f3d52f8 100755 --- a/examples/ljspeech/tts3/local/synthesize.sh +++ b/examples/ljspeech/tts3/local/synthesize.sh @@ -6,13 +6,15 @@ ckpt_name=$3 FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ -python3 ${BIN_DIR}/synthesize.py \ - --fastspeech2-config=${config_path} \ - --fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ - --fastspeech2-stat=dump/train/speech_stats.npy \ - --pwg-config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \ - --pwg-checkpoint=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz \ - --pwg-stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \ - --test-metadata=dump/test/norm/metadata.jsonl \ - --output-dir=${train_output_path}/test \ - --phones-dict=dump/phone_id_map.txt +python3 ${BIN_DIR}/../synthesize.py \ + --am=fastspeech2_ljspeech \ + --am_config=${config_path} \ + --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --am_stat=dump/train/speech_stats.npy \ + --voc=pwgan_ljspeech \ + --voc_config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \ + --voc_ckpt=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz \ + --voc_stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \ + --test_metadata=dump/test/norm/metadata.jsonl \ + --output_dir=${train_output_path}/test \ + --phones_dict=dump/phone_id_map.txt diff --git a/examples/ljspeech/tts3/local/synthesize_e2e.sh b/examples/ljspeech/tts3/local/synthesize_e2e.sh index c723feefde6987bab65d474686afba4435527280..0b0cb5741938d0455ec6244d70c083a97c89be0d 100755 --- a/examples/ljspeech/tts3/local/synthesize_e2e.sh +++ b/examples/ljspeech/tts3/local/synthesize_e2e.sh @@ -6,13 +6,17 @@ ckpt_name=$3 FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ -python3 ${BIN_DIR}/synthesize_e2e_en.py \ - --fastspeech2-config=${config_path} \ - --fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ - --fastspeech2-stat=dump/train/speech_stats.npy \ - --pwg-config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \ - --pwg-checkpoint=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz \ - --pwg-stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \ - --text=${BIN_DIR}/../sentences_en.txt \ - --output-dir=${train_output_path}/test_e2e \ - --phones-dict=dump/phone_id_map.txt +python3 ${BIN_DIR}/../synthesize_e2e.py \ + --am=fastspeech2_ljspeech \ + --am_config=${config_path} \ + --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --am_stat=dump/train/speech_stats.npy \ + --voc=pwgan_ljspeech \ + --voc_config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \ + --voc_ckpt=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz \ + --voc_stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \ + --lang=en \ + --text=${BIN_DIR}/../sentences_en.txt \ + --output_dir=${train_output_path}/test_e2e \ + --inference_dir=${train_output_path}/inference \ + --phones_dict=dump/phone_id_map.txt \ No newline at end of file diff --git a/examples/ljspeech/voc0/README.md b/examples/ljspeech/voc0/README.md index 725eb617bfd0e88fb9b7628e255a8e217d409607..13a50efb54049f6a78c24cf90fc66f00ad2ef0df 100644 --- a/examples/ljspeech/voc0/README.md +++ b/examples/ljspeech/voc0/README.md @@ -17,7 +17,7 @@ Run the command below to ```bash ./run.sh ``` -You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset. ```bash ./run.sh --stage 0 --stop-stage 0 ``` @@ -45,7 +45,7 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${input_mel_path} ${train_out Synthesize waveform. 1. We assume the `--input` is a directory containing several mel spectrograms(log magnitude) in `.npy` format. -2. The output would be saved in `--output` directory, containing several `.wav` files, each with the same name as the mel spectrogram does. +2. The output would be saved in the `--output` directory, containing several `.wav` files, each with the same name as the mel spectrogram does. 3. `--checkpoint_path` should be the path of the parameter file (`.pdparams`) to load. Note that the extention name `.pdparmas` is not included here. 6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. diff --git a/examples/ljspeech/voc1/README.md b/examples/ljspeech/voc1/README.md index 5c556124a12fd34190169d7bed8bf518940ab0b4..9dd0f5cc340ea445feeb7ab13f3cbb343a330f45 100644 --- a/examples/ljspeech/voc1/README.md +++ b/examples/ljspeech/voc1/README.md @@ -4,8 +4,8 @@ This example contains code used to train a [parallel wavegan](http://arxiv.org/a ### Download and Extract Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech-Dataset/). ### Get MFA Result and Extract -We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. -You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. +We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut the silence in the edge of audio. +You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. ## Get Started Assume the path to the dataset is `~/datasets/LJSpeech-1.1`. @@ -19,7 +19,7 @@ Run the command below to ```bash ./run.sh ``` -You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset. ```bash ./run.sh --stage 0 --stop-stage 0 ``` @@ -43,9 +43,9 @@ dump └── feats_stats.npy ``` -The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains log magnitude of mel spectrogram of each utterances, while the norm folder contains normalized spectrogram. The statistics used to normalize the spectrogram is computed from the training set, which is located in `dump/train/feats_stats.npy`. +The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`. -Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance. +Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance. ### Model Training `./local/train.sh` calls `${BIN_DIR}/train.py`. @@ -91,7 +91,7 @@ benchmark: 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. -3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. +3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ### Synthesizing @@ -100,15 +100,19 @@ benchmark: CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} ``` ```text -usage: synthesize.py [-h] [--config CONFIG] [--checkpoint CHECKPOINT] - [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR] - [--ngpu NGPU] [--verbose VERBOSE] +usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG] + [--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA] + [--output-dir OUTPUT_DIR] [--ngpu NGPU] + [--verbose VERBOSE] -Synthesize with parallel wavegan. +Synthesize with GANVocoder. optional arguments: -h, --help show this help message and exit - --config CONFIG parallel wavegan config file. + --generator-type GENERATOR_TYPE + type of GANVocoder, should in {pwgan, mb_melgan, + style_melgan, } now + --config CONFIG GANVocoder config file. --checkpoint CHECKPOINT snapshot to load. --test-metadata TEST_METADATA diff --git a/examples/ljspeech/voc1/conf/default.yaml b/examples/ljspeech/voc1/conf/default.yaml index fb97ea8e5fdcd883fe2960c0851fa59eaa08ce20..bef2d68149027125be74f09c523bafd29e49916f 100644 --- a/examples/ljspeech/voc1/conf/default.yaml +++ b/examples/ljspeech/voc1/conf/default.yaml @@ -7,9 +7,9 @@ # FEATURE EXTRACTION SETTING # ########################################################### fs: 22050 # Sampling rate. -n_fft: 1024 # FFT size. (in samples) -n_shift: 256 # Hop size. (in samples) -win_length: null # Window length. (in samples) +n_fft: 1024 # FFT size (samples). +n_shift: 256 # Hop size (samples). 11.6ms +win_length: null # Window length (samples). # If set to null, it will be the same as fft_size. window: "hann" # Window function. n_mels: 80 # Number of mel basis. @@ -49,9 +49,9 @@ discriminator_params: bias: true # Whether to use bias parameter in conv. use_weight_norm: true # Whether to use weight norm. # If set to true, it will be applied to all of the conv layers. - nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv. + nonlinear_activation: "leakyrelu" # Nonlinear function after each conv. nonlinear_activation_params: # Nonlinear function parameters - negative_slope: 0.2 # Alpha in LeakyReLU. + negative_slope: 0.2 # Alpha in leakyrelu. ########################################################### # STFT LOSS SETTING # diff --git a/examples/ljspeech/voc1/local/synthesize.sh b/examples/ljspeech/voc1/local/synthesize.sh index d85d1b1d970c613630e6288c5b33d88f7a475016..145557b3d35a3e936c5c394ba9ca99bcbe6b3910 100755 --- a/examples/ljspeech/voc1/local/synthesize.sh +++ b/examples/ljspeech/voc1/local/synthesize.sh @@ -7,8 +7,8 @@ ckpt_name=$3 FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ python3 ${BIN_DIR}/../synthesize.py \ - --config=${config_path} \ - --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ - --test-metadata=dump/test/norm/metadata.jsonl \ - --output-dir=${train_output_path}/test \ - --generator-type=pwgan + --config=${config_path} \ + --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ + --test-metadata=dump/test/norm/metadata.jsonl \ + --output-dir=${train_output_path}/test \ + --generator-type=pwgan diff --git a/examples/vctk/tts3/README.md b/examples/vctk/tts3/README.md index d7d632cdd305302a88c083639b1f24ce5bb15d64..1f2c9338e43ea25d05f4902bc2c3575d450c7f2d 100644 --- a/examples/vctk/tts3/README.md +++ b/examples/vctk/tts3/README.md @@ -2,14 +2,14 @@ This example contains code used to train a [Fastspeech2](https://arxiv.org/abs/2006.04558) model with [VCTK](https://datashare.ed.ac.uk/handle/10283/3443). ## Dataset -### Download and Extract the datasaet +### Download and Extract the dataset Download VCTK-0.92 from the [official website](https://datashare.ed.ac.uk/handle/10283/3443). ### Get MFA Result and Extract We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2. -You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. +You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/mfa/local/reorganize_vctk.py)): -1. `p315`, because no txt for it. +1. `p315`, because of no text for it. 2. `p280` and `p362`, because no *_mic2.flac (which is better than *_mic1.flac) for them. ## Get Started @@ -25,7 +25,7 @@ Run the command below to ```bash ./run.sh ``` -You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset. ```bash ./run.sh --stage 0 --stop-stage 0 ``` @@ -52,9 +52,9 @@ dump ├── raw └── speech_stats.npy ``` -The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains speech、pitch and energy features of each utterances, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`. +The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech、pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`. -Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, durations, path of speech features, path of pitch features, path of energy features, speaker and id of each utterance. +Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, the path of pitch features, the path of energy features, speaker, and id of each utterance. ### Model Training ```bash @@ -88,7 +88,7 @@ optional arguments: ``` 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. -3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. +3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory. 4. `--phones-dict` is the path of the phone vocabulary file. ### Synthesizing @@ -105,99 +105,115 @@ pwg_vctk_ckpt_0.5 ├── pwg_snapshot_iter_1000000.pdz # generator parameters of parallel wavegan └── pwg_stats.npy # statistics used to normalize spectrogram when training parallel wavegan ``` -`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`. +`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} ``` ```text -usage: synthesize.py [-h] [--fastspeech2-config FASTSPEECH2_CONFIG] - [--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT] - [--fastspeech2-stat FASTSPEECH2_STAT] - [--pwg-config PWG_CONFIG] - [--pwg-checkpoint PWG_CHECKPOINT] [--pwg-stat PWG_STAT] - [--phones-dict PHONES_DICT] [--speaker-dict SPEAKER_DICT] - [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR] - [--ngpu NGPU] [--verbose VERBOSE] - -Synthesize with fastspeech2 & parallel wavegan. +usage: synthesize.py [-h] + [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}] + [--am_config AM_CONFIG] [--am_ckpt AM_CKPT] + [--am_stat AM_STAT] [--phones_dict PHONES_DICT] + [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT] + [--voice-cloning VOICE_CLONING] + [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}] + [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] + [--voc_stat VOC_STAT] [--ngpu NGPU] + [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR] + +Synthesize with acoustic model & vocoder optional arguments: -h, --help show this help message and exit - --fastspeech2-config FASTSPEECH2_CONFIG - fastspeech2 config file. - --fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT - fastspeech2 checkpoint to load. - --fastspeech2-stat FASTSPEECH2_STAT - mean and standard deviation used to normalize - spectrogram when training fastspeech2. - --pwg-config PWG_CONFIG - parallel wavegan config file. - --pwg-checkpoint PWG_CHECKPOINT - parallel wavegan generator parameters to load. - --pwg-stat PWG_STAT mean and standard deviation used to normalize - spectrogram when training parallel wavegan. - --phones-dict PHONES_DICT + --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk} + Choose acoustic model type of tts task. + --am_config AM_CONFIG + Config of acoustic model. Use deault config when it is + None. + --am_ckpt AM_CKPT Checkpoint file of acoustic model. + --am_stat AM_STAT mean and standard deviation used to normalize + spectrogram when training acoustic model. + --phones_dict PHONES_DICT phone vocabulary file. - --speaker-dict SPEAKER_DICT - speaker id map file for multiple speaker model. - --test-metadata TEST_METADATA + --tones_dict TONES_DICT + tone vocabulary file. + --speaker_dict SPEAKER_DICT + speaker id map file. + --voice-cloning VOICE_CLONING + whether training voice cloning model. + --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc} + Choose vocoder type of tts task. + --voc_config VOC_CONFIG + Config of voc. Use deault config when it is None. + --voc_ckpt VOC_CKPT Checkpoint file of voc. + --voc_stat VOC_STAT mean and standard deviation used to normalize + spectrogram when training voc. + --ngpu NGPU if ngpu == 0, use cpu. + --test_metadata TEST_METADATA test metadata. - --output-dir OUTPUT_DIR + --output_dir OUTPUT_DIR output dir. - --ngpu NGPU if ngpu == 0, use cpu. - --verbose VERBOSE verbose. ``` -`./local/synthesize_e2e.sh` calls `${BIN_DIR}/multi_spk_synthesize_e2e_en.py`, which can synthesize waveform from text file. +`./local/synthesize_e2e.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} ``` ```text -usage: multi_spk_synthesize_e2e_en.py [-h] - [--fastspeech2-config FASTSPEECH2_CONFIG] - [--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT] - [--fastspeech2-stat FASTSPEECH2_STAT] - [--pwg-config PWG_CONFIG] - [--pwg-checkpoint PWG_CHECKPOINT] - [--pwg-stat PWG_STAT] - [--phones-dict PHONES_DICT] - [--speaker-dict SPEAKER_DICT] - [--text TEXT] [--output-dir OUTPUT_DIR] - [--ngpu NGPU] [--verbose VERBOSE] - -Synthesize with fastspeech2 & parallel wavegan. +usage: synthesize_e2e.py [-h] + [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}] + [--am_config AM_CONFIG] [--am_ckpt AM_CKPT] + [--am_stat AM_STAT] [--phones_dict PHONES_DICT] + [--tones_dict TONES_DICT] + [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID] + [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}] + [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] + [--voc_stat VOC_STAT] [--lang LANG] + [--inference_dir INFERENCE_DIR] [--ngpu NGPU] + [--text TEXT] [--output_dir OUTPUT_DIR] + +Synthesize with acoustic model & vocoder optional arguments: -h, --help show this help message and exit - --fastspeech2-config FASTSPEECH2_CONFIG - fastspeech2 config file. - --fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT - fastspeech2 checkpoint to load. - --fastspeech2-stat FASTSPEECH2_STAT - mean and standard deviation used to normalize - spectrogram when training fastspeech2. - --pwg-config PWG_CONFIG - parallel wavegan config file. - --pwg-checkpoint PWG_CHECKPOINT - parallel wavegan generator parameters to load. - --pwg-stat PWG_STAT mean and standard deviation used to normalize - spectrogram when training parallel wavegan. - --phones-dict PHONES_DICT + --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk} + Choose acoustic model type of tts task. + --am_config AM_CONFIG + Config of acoustic model. Use deault config when it is + None. + --am_ckpt AM_CKPT Checkpoint file of acoustic model. + --am_stat AM_STAT mean and standard deviation used to normalize + spectrogram when training acoustic model. + --phones_dict PHONES_DICT phone vocabulary file. - --speaker-dict SPEAKER_DICT + --tones_dict TONES_DICT + tone vocabulary file. + --speaker_dict SPEAKER_DICT speaker id map file. + --spk_id SPK_ID spk id for multi speaker acoustic model + --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc} + Choose vocoder type of tts task. + --voc_config VOC_CONFIG + Config of voc. Use deault config when it is None. + --voc_ckpt VOC_CKPT Checkpoint file of voc. + --voc_stat VOC_STAT mean and standard deviation used to normalize + spectrogram when training voc. + --lang LANG Choose model language. zh or en + --inference_dir INFERENCE_DIR + dir to save inference models + --ngpu NGPU if ngpu == 0, use cpu. --text TEXT text to synthesize, a 'utt_id sentence' pair per line. - --output-dir OUTPUT_DIR + --output_dir OUTPUT_DIR output dir. - --ngpu NGPU if ngpu == 0, use cpu. - --verbose VERBOSE verbose. ``` - -1. `--fastspeech2-config`, `--fastspeech2-checkpoint`, `--fastspeech2-stat` and `--phones-dict` are arguments for fastspeech2, which correspond to the 4 files in the fastspeech2 pretrained model. -2. `--pwg-config`, `--pwg-checkpoint`, `--pwg-stat` are arguments for parallel wavegan, which correspond to the 3 files in the parallel wavegan pretrained model. -3. `--test-metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder. -4. `--text` is the text file, which contains sentences to synthesize. -5. `--output-dir` is the directory to save synthesized audio files. -6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. +1. `--am` is acoustic model type with the format {model_name}_{dataset} +2. `--am_config`, `--am_checkpoint`, `--am_stat`, `--phones_dict` `--speaker_dict` are arguments for acoustic model, which correspond to the 5 files in the fastspeech2 pretrained model. +3. `--voc` is vocoder type with the format {model_name}_{dataset} +4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model. +5. `--lang` is the model language, which can be `zh` or `en`. +6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder. +7. `--text` is the text file, which contains sentences to synthesize. +8. `--output_dir` is the directory to save synthesized audio files. +9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ## Pretrained Model Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip) @@ -217,15 +233,19 @@ source path.sh FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ -python3 ${BIN_DIR}/multi_spk_synthesize_e2e_en.py \ - --fastspeech2-config=fastspeech2_nosil_vctk_ckpt_0.5/default.yaml \ - --fastspeech2-checkpoint=fastspeech2_nosil_vctk_ckpt_0.5/snapshot_iter_66200.pdz \ - --fastspeech2-stat=fastspeech2_nosil_vctk_ckpt_0.5/speech_stats.npy \ - --pwg-config=pwg_vctk_ckpt_0.5/pwg_default.yaml \ - --pwg-checkpoint=pwg_vctk_ckpt_0.5/pwg_snapshot_iter_1000000.pdz \ - --pwg-stat=pwg_vctk_ckpt_0.5/pwg_stats.npy \ +python3 ${BIN_DIR}/../synthesize_e2e.py \ + --am=fastspeech2_vctk \ + --am_config=fastspeech2_nosil_vctk_ckpt_0.5/default.yaml \ + --am_ckpt=fastspeech2_nosil_vctk_ckpt_0.5/snapshot_iter_66200.pdz \ + --am_stat=fastspeech2_nosil_vctk_ckpt_0.5/speech_stats.npy \ + --voc=pwgan_vctk \ + --voc_config=pwg_vctk_ckpt_0.5/pwg_default.yaml \ + --voc_ckpt=pwg_vctk_ckpt_0.5/pwg_snapshot_iter_1000000.pdz \ + --voc_stat=pwg_vctk_ckpt_0.5/pwg_stats.npy \ + --lang=en \ --text=${BIN_DIR}/../sentences_en.txt \ - --output-dir=exp/default/test_e2e \ - --phones-dict=fastspeech2_nosil_vctk_ckpt_0.5/phone_id_map.txt \ - --speaker-dict=fastspeech2_nosil_vctk_ckpt_0.5/speaker_id_map.txt + --output_dir=exp/default/test_e2e \ + --phones_dict=dump/phone_id_map.txt \ + --speaker_dict=dump/speaker_id_map.txt \ + --spk_id=0 ``` diff --git a/examples/vctk/tts3/conf/default.yaml b/examples/vctk/tts3/conf/default.yaml index 4f945a31c595e2f1d06962e5b08c81a0978a2221..2738e7c224514ac948107511e4679e93e4f75721 100644 --- a/examples/vctk/tts3/conf/default.yaml +++ b/examples/vctk/tts3/conf/default.yaml @@ -3,9 +3,9 @@ ########################################################### fs: 24000 # sr -n_fft: 2048 # FFT size. -n_shift: 300 # Hop size. -win_length: 1200 # Window length. +n_fft: 2048 # FFT size (samples). +n_shift: 300 # Hop size (samples). 12.5ms +win_length: 1200 # Window length.(in samples) 50ms # If set to null, it will be the same as fft_size. window: "hann" # Window function. diff --git a/examples/vctk/tts3/local/synthesize.sh b/examples/vctk/tts3/local/synthesize.sh index 8165c8581b061e77009a43271c95f9cf15404346..a8aef034cf0e03b0449ba96a219847ededb47b73 100755 --- a/examples/vctk/tts3/local/synthesize.sh +++ b/examples/vctk/tts3/local/synthesize.sh @@ -6,14 +6,16 @@ ckpt_name=$3 FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ -python3 ${BIN_DIR}/synthesize.py \ - --fastspeech2-config=${config_path} \ - --fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ - --fastspeech2-stat=dump/train/speech_stats.npy \ - --pwg-config=pwg_vctk_ckpt_0.5/pwg_default.yaml \ - --pwg-checkpoint=pwg_vctk_ckpt_0.5/pwg_snapshot_iter_1000000.pdz \ - --pwg-stat=pwg_vctk_ckpt_0.5/pwg_stats.npy \ - --test-metadata=dump/test/norm/metadata.jsonl \ - --output-dir=${train_output_path}/test \ - --phones-dict=dump/phone_id_map.txt \ - --speaker-dict=dump/speaker_id_map.txt +python3 ${BIN_DIR}/../synthesize.py \ + --am=fastspeech2_vctk \ + --am_config=${config_path} \ + --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --am_stat=dump/train/speech_stats.npy \ + --voc=pwgan_vctk \ + --voc_config=pwg_vctk_ckpt_0.5/pwg_default.yaml \ + --voc_ckpt=pwg_vctk_ckpt_0.5/pwg_snapshot_iter_1000000.pdz \ + --voc_stat=pwg_vctk_ckpt_0.5/pwg_stats.npy \ + --test_metadata=dump/test/norm/metadata.jsonl \ + --output_dir=${train_output_path}/test \ + --phones_dict=dump/phone_id_map.txt \ + --speaker_dict=dump/speaker_id_map.txt diff --git a/examples/vctk/tts3/local/synthesize_e2e.sh b/examples/vctk/tts3/local/synthesize_e2e.sh index e0b2a041f5213745f2ce8c86cd28de09d69e1de2..954e8cb9a77681929158ce29f86a73de2077bb39 100755 --- a/examples/vctk/tts3/local/synthesize_e2e.sh +++ b/examples/vctk/tts3/local/synthesize_e2e.sh @@ -6,14 +6,18 @@ ckpt_name=$3 FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ -python3 ${BIN_DIR}/multi_spk_synthesize_e2e_en.py \ - --fastspeech2-config=${config_path} \ - --fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ - --fastspeech2-stat=dump/train/speech_stats.npy \ - --pwg-config=pwg_vctk_ckpt_0.5/pwg_default.yaml \ - --pwg-checkpoint=pwg_vctk_ckpt_0.5/pwg_snapshot_iter_1000000.pdz \ - --pwg-stat=pwg_vctk_ckpt_0.5/pwg_stats.npy \ - --text=${BIN_DIR}/../sentences_en.txt \ - --output-dir=${train_output_path}/test_e2e \ - --phones-dict=dump/phone_id_map.txt \ - --speaker-dict=dump/speaker_id_map.txt +python3 ${BIN_DIR}/../synthesize_e2e.py \ + --am=fastspeech2_vctk \ + --am_config=${config_path} \ + --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --am_stat=dump/train/speech_stats.npy \ + --voc=pwgan_vctk \ + --voc_config=pwg_vctk_ckpt_0.5/pwg_default.yaml \ + --voc_ckpt=pwg_vctk_ckpt_0.5/pwg_snapshot_iter_1000000.pdz \ + --voc_stat=pwg_vctk_ckpt_0.5/pwg_stats.npy \ + --lang=en \ + --text=${BIN_DIR}/../sentences_en.txt \ + --output_dir=${train_output_path}/test_e2e \ + --phones_dict=dump/phone_id_map.txt \ + --speaker_dict=dump/speaker_id_map.txt \ + --spk_id=0 diff --git a/examples/vctk/voc1/README.md b/examples/vctk/voc1/README.md index 6d7b325630055c43de3d66735f8561108e6894ac..78254d4e0412b8bff08600b4250e49813f359452 100644 --- a/examples/vctk/voc1/README.md +++ b/examples/vctk/voc1/README.md @@ -6,10 +6,10 @@ This example contains code used to train a [parallel wavegan](http://arxiv.org/a Download VCTK-0.92 from the [official website](https://datashare.ed.ac.uk/handle/10283/3443) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/VCTK-Corpus-0.92`. ### Get MFA Result and Extract -We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. -You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. +We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut the silence in the edge of audio. +You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/mfa/local/reorganize_vctk.py)): -1. `p315`, because no txt for it. +1. `p315`, because of no text for it. 2. `p280` and `p362`, because no *_mic2.flac (which is better than *_mic1.flac) for them. ## Get Started @@ -24,7 +24,7 @@ Run the command below to ```bash ./run.sh ``` -You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset. ```bash ./run.sh --stage 0 --stop-stage 0 ``` @@ -48,9 +48,9 @@ dump └── feats_stats.npy ``` -The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains log magnitude of mel spectrogram of each utterances, while the norm folder contains normalized spectrogram. The statistics used to normalize the spectrogram is computed from the training set, which is located in `dump/train/feats_stats.npy`. +The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`. -Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance. +Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance. ### Model Training ```bash @@ -96,7 +96,7 @@ benchmark: 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. -3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. +3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ### Synthesizing @@ -105,15 +105,19 @@ benchmark: CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} ``` ```text -usage: synthesize.py [-h] [--config CONFIG] [--checkpoint CHECKPOINT] - [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR] - [--ngpu NGPU] [--verbose VERBOSE] +usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG] + [--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA] + [--output-dir OUTPUT_DIR] [--ngpu NGPU] + [--verbose VERBOSE] -Synthesize with parallel wavegan. +Synthesize with GANVocoder. optional arguments: -h, --help show this help message and exit - --config CONFIG parallel wavegan config file. + --generator-type GENERATOR_TYPE + type of GANVocoder, should in {pwgan, mb_melgan, + style_melgan, } now + --config CONFIG GANVocoder config file. --checkpoint CHECKPOINT snapshot to load. --test-metadata TEST_METADATA diff --git a/examples/vctk/voc1/conf/default.yaml b/examples/vctk/voc1/conf/default.yaml index eb6d350d41af6af148e64551ac53dad30c77c0d6..d95eaad9dfdd6453be5b3bd24da85cf75ba5e59d 100644 --- a/examples/vctk/voc1/conf/default.yaml +++ b/examples/vctk/voc1/conf/default.yaml @@ -7,9 +7,9 @@ # FEATURE EXTRACTION SETTING # ########################################################### fs: 24000 # Sampling rate. -n_fft: 2048 # FFT size. (in samples) -n_shift: 300 # Hop size. (in samples) -win_length: 1200 # Window length. (in samples) +n_fft: 2048 # FFT size (samples). +n_shift: 300 # Hop size (samples). 12.5ms +win_length: 1200 # Window length (samples). 50ms # If set to null, it will be the same as fft_size. window: "hann" # Window function. n_mels: 80 # Number of mel basis. @@ -49,9 +49,9 @@ discriminator_params: bias: true # Whether to use bias parameter in conv. use_weight_norm: true # Whether to use weight norm. # If set to true, it will be applied to all of the conv layers. - nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv. + nonlinear_activation: "leakyrelu" # Nonlinear function after each conv. nonlinear_activation_params: # Nonlinear function parameters - negative_slope: 0.2 # Alpha in LeakyReLU. + negative_slope: 0.2 # Alpha in leakyrelu. ########################################################### # STFT LOSS SETTING # diff --git a/examples/vctk/voc1/local/synthesize.sh b/examples/vctk/voc1/local/synthesize.sh index d85d1b1d970c613630e6288c5b33d88f7a475016..145557b3d35a3e936c5c394ba9ca99bcbe6b3910 100755 --- a/examples/vctk/voc1/local/synthesize.sh +++ b/examples/vctk/voc1/local/synthesize.sh @@ -7,8 +7,8 @@ ckpt_name=$3 FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ python3 ${BIN_DIR}/../synthesize.py \ - --config=${config_path} \ - --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ - --test-metadata=dump/test/norm/metadata.jsonl \ - --output-dir=${train_output_path}/test \ - --generator-type=pwgan + --config=${config_path} \ + --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \ + --test-metadata=dump/test/norm/metadata.jsonl \ + --output-dir=${train_output_path}/test \ + --generator-type=pwgan diff --git a/paddlespeech/t2s/exps/fastspeech2/inference.py b/paddlespeech/t2s/exps/fastspeech2/inference.py deleted file mode 100644 index 1d6ea667a28a8a8bb751f339ab25bd7e6edcff1b..0000000000000000000000000000000000000000 --- a/paddlespeech/t2s/exps/fastspeech2/inference.py +++ /dev/null @@ -1,135 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import argparse -import os -from pathlib import Path - -import soundfile as sf -from paddle import inference - -from paddlespeech.t2s.frontend.zh_frontend import Frontend - - -def main(): - parser = argparse.ArgumentParser( - description="Paddle Infernce with speedyspeech & parallel wavegan.") - parser.add_argument( - "--inference-dir", type=str, help="dir to save inference models") - parser.add_argument( - "--text", - type=str, - help="text to synthesize, a 'utt_id sentence' pair per line") - parser.add_argument("--output-dir", type=str, help="output dir") - parser.add_argument( - "--enable-auto-log", action="store_true", help="use auto log") - parser.add_argument( - "--phones-dict", - type=str, - default="phones.txt", - help="phone vocabulary file.") - - args, _ = parser.parse_known_args() - - frontend = Frontend(phone_vocab_path=args.phones_dict) - print("frontend done!") - - fastspeech2_config = inference.Config( - str(Path(args.inference_dir) / "fastspeech2.pdmodel"), - str(Path(args.inference_dir) / "fastspeech2.pdiparams")) - fastspeech2_config.enable_use_gpu(50, 0) - # This line must be commented, if not, it will OOM - # fastspeech2_config.enable_memory_optim() - fastspeech2_predictor = inference.create_predictor(fastspeech2_config) - - pwg_config = inference.Config( - str(Path(args.inference_dir) / "pwg.pdmodel"), - str(Path(args.inference_dir) / "pwg.pdiparams")) - pwg_config.enable_use_gpu(100, 0) - pwg_config.enable_memory_optim() - pwg_predictor = inference.create_predictor(pwg_config) - - if args.enable_auto_log: - import auto_log - os.makedirs("output", exist_ok=True) - pid = os.getpid() - logger = auto_log.AutoLogger( - model_name="fastspeech2", - model_precision='float32', - batch_size=1, - data_shape="dynamic", - save_path="./output/auto_log.log", - inference_config=fastspeech2_config, - pids=pid, - process_name=None, - gpu_ids=0, - time_keys=['preprocess_time', 'inference_time', 'postprocess_time'], - warmup=0) - - output_dir = Path(args.output_dir) - output_dir.mkdir(parents=True, exist_ok=True) - sentences = [] - - with open(args.text, 'rt') as f: - for line in f: - items = line.strip().split() - utt_id = items[0] - sentence = "".join(items[1:]) - sentences.append((utt_id, sentence)) - - for utt_id, sentence in sentences: - if args.enable_auto_log: - logger.times.start() - input_ids = frontend.get_input_ids(sentence, merge_sentences=True) - phone_ids = input_ids["phone_ids"] - phones = phone_ids[0].numpy() - - if args.enable_auto_log: - logger.times.stamp() - - input_names = fastspeech2_predictor.get_input_names() - phones_handle = fastspeech2_predictor.get_input_handle(input_names[0]) - - phones_handle.reshape(phones.shape) - phones_handle.copy_from_cpu(phones) - - fastspeech2_predictor.run() - output_names = fastspeech2_predictor.get_output_names() - output_handle = fastspeech2_predictor.get_output_handle(output_names[0]) - output_data = output_handle.copy_to_cpu() - - input_names = pwg_predictor.get_input_names() - mel_handle = pwg_predictor.get_input_handle(input_names[0]) - mel_handle.reshape(output_data.shape) - mel_handle.copy_from_cpu(output_data) - - pwg_predictor.run() - output_names = pwg_predictor.get_output_names() - output_handle = pwg_predictor.get_output_handle(output_names[0]) - wav = output_data = output_handle.copy_to_cpu() - - if args.enable_auto_log: - logger.times.stamp() - - sf.write(output_dir / (utt_id + ".wav"), wav, samplerate=24000) - - if args.enable_auto_log: - logger.times.end(stamp=True) - print(f"{utt_id} done!") - - if args.enable_auto_log: - logger.report() - - -if __name__ == "__main__": - main() diff --git a/paddlespeech/t2s/exps/fastspeech2/multi_spk_synthesize_e2e.py b/paddlespeech/t2s/exps/fastspeech2/multi_spk_synthesize_e2e.py deleted file mode 100644 index 9dc3ab4b655badb41bcaf16b738bf9a1f3fa1faa..0000000000000000000000000000000000000000 --- a/paddlespeech/t2s/exps/fastspeech2/multi_spk_synthesize_e2e.py +++ /dev/null @@ -1,178 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import argparse -import logging -from pathlib import Path - -import numpy as np -import paddle -import soundfile as sf -import yaml -from yacs.config import CfgNode - -from paddlespeech.t2s.frontend.zh_frontend import Frontend -from paddlespeech.t2s.models.fastspeech2 import FastSpeech2 -from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference -from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator -from paddlespeech.t2s.models.parallel_wavegan import PWGInference -from paddlespeech.t2s.modules.normalizer import ZScore - - -def evaluate(args, fastspeech2_config, pwg_config): - # dataloader has been too verbose - logging.getLogger("DataLoader").disabled = True - - # construct dataset for evaluation - sentences = [] - with open(args.text, 'rt') as f: - for line in f: - items = line.strip().split() - utt_id = items[0] - sentence = "".join(items[1:]) - sentences.append((utt_id, sentence)) - - with open(args.phones_dict, "r") as f: - phn_id = [line.strip().split() for line in f.readlines()] - vocab_size = len(phn_id) - print("vocab_size:", vocab_size) - with open(args.speaker_dict, 'rt') as f: - spk_id = [line.strip().split() for line in f.readlines()] - spk_num = len(spk_id) - print("spk_num:", spk_num) - - odim = fastspeech2_config.n_mels - model = FastSpeech2( - idim=vocab_size, - odim=odim, - spk_num=spk_num, - **fastspeech2_config["model"]) - - model.set_state_dict( - paddle.load(args.fastspeech2_checkpoint)["main_params"]) - model.eval() - - vocoder = PWGGenerator(**pwg_config["generator_params"]) - vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"]) - vocoder.remove_weight_norm() - vocoder.eval() - print("model done!") - - frontend = Frontend(phone_vocab_path=args.phones_dict) - print("frontend done!") - - stat = np.load(args.fastspeech2_stat) - mu, std = stat - mu = paddle.to_tensor(mu) - std = paddle.to_tensor(std) - fastspeech2_normalizer = ZScore(mu, std) - - stat = np.load(args.pwg_stat) - mu, std = stat - mu = paddle.to_tensor(mu) - std = paddle.to_tensor(std) - pwg_normalizer = ZScore(mu, std) - - fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model) - pwg_inference = PWGInference(pwg_normalizer, vocoder) - - output_dir = Path(args.output_dir) - output_dir.mkdir(parents=True, exist_ok=True) - # only test the number 0 speaker - spk_ids = list(range(20)) - for spk_id in spk_ids: - for utt_id, sentence in sentences[:2]: - input_ids = frontend.get_input_ids(sentence, merge_sentences=True) - phone_ids = input_ids["phone_ids"] - flags = 0 - for part_phone_ids in phone_ids: - with paddle.no_grad(): - mel = fastspeech2_inference( - part_phone_ids, spk_id=paddle.to_tensor(spk_id)) - temp_wav = pwg_inference(mel) - if flags == 0: - wav = temp_wav - flags = 1 - else: - wav = paddle.concat([wav, temp_wav]) - sf.write( - str(output_dir / (str(spk_id) + "_" + utt_id + ".wav")), - wav.numpy(), - samplerate=fastspeech2_config.fs) - print(f"{spk_id}_{utt_id} done!") - - -def main(): - # parse args and config and redirect to train_sp - parser = argparse.ArgumentParser( - description="Synthesize with fastspeech2 & parallel wavegan.") - parser.add_argument( - "--fastspeech2-config", type=str, help="fastspeech2 config file.") - parser.add_argument( - "--fastspeech2-checkpoint", - type=str, - help="fastspeech2 checkpoint to load.") - parser.add_argument( - "--fastspeech2-stat", - type=str, - help="mean and standard deviation used to normalize spectrogram when training fastspeech2." - ) - parser.add_argument( - "--pwg-config", type=str, help="parallel wavegan config file.") - parser.add_argument( - "--pwg-checkpoint", - type=str, - help="parallel wavegan generator parameters to load.") - parser.add_argument( - "--pwg-stat", - type=str, - help="mean and standard deviation used to normalize spectrogram when training parallel wavegan." - ) - parser.add_argument( - "--phones-dict", type=str, default=None, help="phone vocabulary file.") - parser.add_argument( - "--speaker-dict", type=str, default=None, help="speaker id map file.") - parser.add_argument( - "--text", - type=str, - help="text to synthesize, a 'utt_id sentence' pair per line.") - parser.add_argument("--output-dir", type=str, help="output dir.") - parser.add_argument( - "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.") - parser.add_argument("--verbose", type=int, default=1, help="verbose.") - - args = parser.parse_args() - - if args.ngpu == 0: - paddle.set_device("cpu") - elif args.ngpu > 0: - paddle.set_device("gpu") - else: - print("ngpu should >= 0 !") - - with open(args.fastspeech2_config) as f: - fastspeech2_config = CfgNode(yaml.safe_load(f)) - with open(args.pwg_config) as f: - pwg_config = CfgNode(yaml.safe_load(f)) - - print("========Args========") - print(yaml.safe_dump(vars(args))) - print("========Config========") - print(fastspeech2_config) - print(pwg_config) - - evaluate(args, fastspeech2_config, pwg_config) - - -if __name__ == "__main__": - main() diff --git a/paddlespeech/t2s/exps/fastspeech2/multi_spk_synthesize_e2e_en.py b/paddlespeech/t2s/exps/fastspeech2/multi_spk_synthesize_e2e_en.py deleted file mode 100644 index 6a326e8a4026eacc49cbc1f59641d35cfeedc957..0000000000000000000000000000000000000000 --- a/paddlespeech/t2s/exps/fastspeech2/multi_spk_synthesize_e2e_en.py +++ /dev/null @@ -1,175 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import argparse -import logging -from pathlib import Path - -import numpy as np -import paddle -import soundfile as sf -import yaml -from yacs.config import CfgNode - -from paddlespeech.t2s.frontend import English -from paddlespeech.t2s.models.fastspeech2 import FastSpeech2 -from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference -from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator -from paddlespeech.t2s.models.parallel_wavegan import PWGInference -from paddlespeech.t2s.modules.normalizer import ZScore - - -def evaluate(args, fastspeech2_config, pwg_config): - # dataloader has been too verbose - logging.getLogger("DataLoader").disabled = True - - # construct dataset for evaluation - sentences = [] - with open(args.text, 'rt') as f: - for line in f: - line_list = line.strip().split() - utt_id = line_list[0] - sentence = " ".join(line_list[1:]) - sentences.append((utt_id, sentence)) - - with open(args.phones_dict, "r") as f: - phn_id = [line.strip().split() for line in f.readlines()] - vocab_size = len(phn_id) - phone_id_map = {} - for phn, id in phn_id: - phone_id_map[phn] = int(id) - print("vocab_size:", vocab_size) - with open(args.speaker_dict, 'rt') as f: - spk_id = [line.strip().split() for line in f.readlines()] - spk_num = len(spk_id) - print("spk_num:", spk_num) - - odim = fastspeech2_config.n_mels - model = FastSpeech2( - idim=vocab_size, - odim=odim, - spk_num=spk_num, - **fastspeech2_config["model"]) - - model.set_state_dict( - paddle.load(args.fastspeech2_checkpoint)["main_params"]) - model.eval() - - vocoder = PWGGenerator(**pwg_config["generator_params"]) - vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"]) - vocoder.remove_weight_norm() - vocoder.eval() - print("model done!") - - frontend = English(phone_vocab_path=args.phones_dict) - print("frontend done!") - - stat = np.load(args.fastspeech2_stat) - mu, std = stat - mu = paddle.to_tensor(mu) - std = paddle.to_tensor(std) - fastspeech2_normalizer = ZScore(mu, std) - - stat = np.load(args.pwg_stat) - mu, std = stat - mu = paddle.to_tensor(mu) - std = paddle.to_tensor(std) - pwg_normalizer = ZScore(mu, std) - - fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model) - pwg_inference = PWGInference(pwg_normalizer, vocoder) - - output_dir = Path(args.output_dir) - output_dir.mkdir(parents=True, exist_ok=True) - # only test the number 0 speaker - spk_id = 0 - for utt_id, sentence in sentences: - input_ids = frontend.get_input_ids(sentence) - phone_ids = input_ids["phone_ids"] - - with paddle.no_grad(): - mel = fastspeech2_inference( - phone_ids, spk_id=paddle.to_tensor(spk_id)) - wav = pwg_inference(mel) - - sf.write( - str(output_dir / (str(spk_id) + "_" + utt_id + ".wav")), - wav.numpy(), - samplerate=fastspeech2_config.fs) - print(f"{spk_id}_{utt_id} done!") - - -def main(): - # parse args and config and redirect to train_sp - parser = argparse.ArgumentParser( - description="Synthesize with fastspeech2 & parallel wavegan.") - parser.add_argument( - "--fastspeech2-config", type=str, help="fastspeech2 config file.") - parser.add_argument( - "--fastspeech2-checkpoint", - type=str, - help="fastspeech2 checkpoint to load.") - parser.add_argument( - "--fastspeech2-stat", - type=str, - help="mean and standard deviation used to normalize spectrogram when training fastspeech2." - ) - parser.add_argument( - "--pwg-config", type=str, help="parallel wavegan config file.") - parser.add_argument( - "--pwg-checkpoint", - type=str, - help="parallel wavegan generator parameters to load.") - parser.add_argument( - "--pwg-stat", - type=str, - help="mean and standard deviation used to normalize spectrogram when training parallel wavegan." - ) - parser.add_argument( - "--phones-dict", type=str, default=None, help="phone vocabulary file.") - parser.add_argument( - "--speaker-dict", type=str, default=None, help="speaker id map file.") - parser.add_argument( - "--text", - type=str, - help="text to synthesize, a 'utt_id sentence' pair per line.") - parser.add_argument("--output-dir", type=str, help="output dir.") - parser.add_argument( - "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.") - parser.add_argument("--verbose", type=int, default=1, help="verbose.") - - args = parser.parse_args() - - if args.ngpu == 0: - paddle.set_device("cpu") - elif args.ngpu > 0: - paddle.set_device("gpu") - else: - print("ngpu should >= 0 !") - - with open(args.fastspeech2_config) as f: - fastspeech2_config = CfgNode(yaml.safe_load(f)) - with open(args.pwg_config) as f: - pwg_config = CfgNode(yaml.safe_load(f)) - - print("========Args========") - print(yaml.safe_dump(vars(args))) - print("========Config========") - print(fastspeech2_config) - print(pwg_config) - - evaluate(args, fastspeech2_config, pwg_config) - - -if __name__ == "__main__": - main() diff --git a/paddlespeech/t2s/exps/fastspeech2/synthesize.py b/paddlespeech/t2s/exps/fastspeech2/synthesize.py deleted file mode 100644 index 249845e4dba5b96747a881eeeb2a2299e020969f..0000000000000000000000000000000000000000 --- a/paddlespeech/t2s/exps/fastspeech2/synthesize.py +++ /dev/null @@ -1,189 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import argparse -import logging -from pathlib import Path - -import jsonlines -import numpy as np -import paddle -import soundfile as sf -import yaml -from yacs.config import CfgNode - -from paddlespeech.t2s.datasets.data_table import DataTable -from paddlespeech.t2s.models.fastspeech2 import FastSpeech2 -from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference -from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator -from paddlespeech.t2s.models.parallel_wavegan import PWGInference -from paddlespeech.t2s.modules.normalizer import ZScore - - -def evaluate(args, fastspeech2_config, pwg_config): - # dataloader has been too verbose - logging.getLogger("DataLoader").disabled = True - - # construct dataset for evaluation - with jsonlines.open(args.test_metadata, 'r') as reader: - test_metadata = list(reader) - - fields = ["utt_id", "text"] - - spk_num = None - if args.speaker_dict is not None: - print("multiple speaker fastspeech2!") - with open(args.speaker_dict, 'rt') as f: - spk_id = [line.strip().split() for line in f.readlines()] - spk_num = len(spk_id) - fields += ["spk_id"] - elif args.voice_cloning: - print("voice cloning!") - fields += ["spk_emb"] - else: - print("single speaker fastspeech2!") - print("spk_num:", spk_num) - - test_dataset = DataTable(data=test_metadata, fields=fields) - - odim = fastspeech2_config.n_mels - with open(args.phones_dict, "r") as f: - phn_id = [line.strip().split() for line in f.readlines()] - vocab_size = len(phn_id) - print("vocab_size:", vocab_size) - - model = FastSpeech2( - idim=vocab_size, - odim=odim, - spk_num=spk_num, - **fastspeech2_config["model"]) - - model.set_state_dict( - paddle.load(args.fastspeech2_checkpoint)["main_params"]) - model.eval() - - vocoder = PWGGenerator(**pwg_config["generator_params"]) - vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"]) - vocoder.remove_weight_norm() - vocoder.eval() - print("model done!") - - stat = np.load(args.fastspeech2_stat) - mu, std = stat - mu = paddle.to_tensor(mu) - std = paddle.to_tensor(std) - fastspeech2_normalizer = ZScore(mu, std) - - stat = np.load(args.pwg_stat) - mu, std = stat - mu = paddle.to_tensor(mu) - std = paddle.to_tensor(std) - pwg_normalizer = ZScore(mu, std) - - fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model) - pwg_inference = PWGInference(pwg_normalizer, vocoder) - - output_dir = Path(args.output_dir) - output_dir.mkdir(parents=True, exist_ok=True) - - for datum in test_dataset: - utt_id = datum["utt_id"] - text = paddle.to_tensor(datum["text"]) - spk_emb = None - spk_id = None - if args.voice_cloning and "spk_emb" in datum: - spk_emb = paddle.to_tensor(np.load(datum["spk_emb"])) - elif "spk_id" in datum: - spk_id = paddle.to_tensor(datum["spk_id"]) - with paddle.no_grad(): - wav = pwg_inference( - fastspeech2_inference(text, spk_id=spk_id, spk_emb=spk_emb)) - sf.write( - str(output_dir / (utt_id + ".wav")), - wav.numpy(), - samplerate=fastspeech2_config.fs) - print(f"{utt_id} done!") - - -def main(): - # parse args and config and redirect to train_sp - parser = argparse.ArgumentParser( - description="Synthesize with fastspeech2 & parallel wavegan.") - parser.add_argument( - "--fastspeech2-config", type=str, help="fastspeech2 config file.") - parser.add_argument( - "--fastspeech2-checkpoint", - type=str, - help="fastspeech2 checkpoint to load.") - parser.add_argument( - "--fastspeech2-stat", - type=str, - help="mean and standard deviation used to normalize spectrogram when training fastspeech2." - ) - parser.add_argument( - "--pwg-config", type=str, help="parallel wavegan config file.") - parser.add_argument( - "--pwg-checkpoint", - type=str, - help="parallel wavegan generator parameters to load.") - parser.add_argument( - "--pwg-stat", - type=str, - help="mean and standard deviation used to normalize spectrogram when training parallel wavegan." - ) - parser.add_argument( - "--phones-dict", type=str, default=None, help="phone vocabulary file.") - parser.add_argument( - "--speaker-dict", - type=str, - default=None, - help="speaker id map file for multiple speaker model.") - - def str2bool(str): - return True if str.lower() == 'true' else False - - parser.add_argument( - "--voice-cloning", - type=str2bool, - default=False, - help="whether training voice cloning model.") - parser.add_argument("--test-metadata", type=str, help="test metadata.") - parser.add_argument("--output-dir", type=str, help="output dir.") - parser.add_argument( - "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.") - parser.add_argument("--verbose", type=int, default=1, help="verbose.") - - args = parser.parse_args() - if args.ngpu == 0: - paddle.set_device("cpu") - elif args.ngpu > 0: - paddle.set_device("gpu") - else: - print("ngpu should >= 0 !") - - with open(args.fastspeech2_config) as f: - fastspeech2_config = CfgNode(yaml.safe_load(f)) - with open(args.pwg_config) as f: - pwg_config = CfgNode(yaml.safe_load(f)) - - print("========Args========") - print(yaml.safe_dump(vars(args))) - print("========Config========") - print(fastspeech2_config) - print(pwg_config) - - evaluate(args, fastspeech2_config, pwg_config) - - -if __name__ == "__main__": - main() diff --git a/paddlespeech/t2s/exps/fastspeech2/synthesize_e2e.py b/paddlespeech/t2s/exps/fastspeech2/synthesize_e2e.py deleted file mode 100644 index 47c8a5e7af57b0547a1baffe05d8de1d93782160..0000000000000000000000000000000000000000 --- a/paddlespeech/t2s/exps/fastspeech2/synthesize_e2e.py +++ /dev/null @@ -1,187 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import argparse -import logging -import os -from pathlib import Path - -import numpy as np -import paddle -import soundfile as sf -import yaml -from paddle import jit -from paddle.static import InputSpec -from yacs.config import CfgNode - -from paddlespeech.t2s.frontend.zh_frontend import Frontend -from paddlespeech.t2s.models.fastspeech2 import FastSpeech2 -from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference -from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator -from paddlespeech.t2s.models.parallel_wavegan import PWGInference -from paddlespeech.t2s.modules.normalizer import ZScore - - -def evaluate(args, fastspeech2_config, pwg_config): - # dataloader has been too verbose - logging.getLogger("DataLoader").disabled = True - - # construct dataset for evaluation - sentences = [] - with open(args.text, 'rt') as f: - for line in f: - items = line.strip().split() - utt_id = items[0] - sentence = "".join(items[1:]) - sentences.append((utt_id, sentence)) - - with open(args.phones_dict, "r") as f: - phn_id = [line.strip().split() for line in f.readlines()] - vocab_size = len(phn_id) - print("vocab_size:", vocab_size) - odim = fastspeech2_config.n_mels - model = FastSpeech2( - idim=vocab_size, odim=odim, **fastspeech2_config["model"]) - - model.set_state_dict( - paddle.load(args.fastspeech2_checkpoint)["main_params"]) - model.eval() - - vocoder = PWGGenerator(**pwg_config["generator_params"]) - vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"]) - vocoder.remove_weight_norm() - vocoder.eval() - print("model done!") - - frontend = Frontend(phone_vocab_path=args.phones_dict) - print("frontend done!") - - stat = np.load(args.fastspeech2_stat) - mu, std = stat - mu = paddle.to_tensor(mu) - std = paddle.to_tensor(std) - fastspeech2_normalizer = ZScore(mu, std) - - stat = np.load(args.pwg_stat) - mu, std = stat - mu = paddle.to_tensor(mu) - std = paddle.to_tensor(std) - pwg_normalizer = ZScore(mu, std) - - fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model) - fastspeech2_inference.eval() - fastspeech2_inference = jit.to_static( - fastspeech2_inference, input_spec=[InputSpec([-1], dtype=paddle.int64)]) - paddle.jit.save(fastspeech2_inference, - os.path.join(args.inference_dir, "fastspeech2")) - fastspeech2_inference = paddle.jit.load( - os.path.join(args.inference_dir, "fastspeech2")) - pwg_inference = PWGInference(pwg_normalizer, vocoder) - pwg_inference.eval() - pwg_inference = jit.to_static( - pwg_inference, input_spec=[ - InputSpec([-1, 80], dtype=paddle.float32), - ]) - paddle.jit.save(pwg_inference, os.path.join(args.inference_dir, "pwg")) - pwg_inference = paddle.jit.load(os.path.join(args.inference_dir, "pwg")) - - output_dir = Path(args.output_dir) - output_dir.mkdir(parents=True, exist_ok=True) - - for utt_id, sentence in sentences: - input_ids = frontend.get_input_ids(sentence, merge_sentences=True) - phone_ids = input_ids["phone_ids"] - flags = 0 - for part_phone_ids in phone_ids: - with paddle.no_grad(): - mel = fastspeech2_inference(part_phone_ids) - temp_wav = pwg_inference(mel) - if flags == 0: - wav = temp_wav - flags = 1 - else: - wav = paddle.concat([wav, temp_wav]) - sf.write( - str(output_dir / (utt_id + ".wav")), - wav.numpy(), - samplerate=fastspeech2_config.fs) - print(f"{utt_id} done!") - - -def main(): - # parse args and config and redirect to train_sp - parser = argparse.ArgumentParser( - description="Synthesize with fastspeech2 & parallel wavegan.") - parser.add_argument( - "--fastspeech2-config", type=str, help="fastspeech2 config file.") - parser.add_argument( - "--fastspeech2-checkpoint", - type=str, - help="fastspeech2 checkpoint to load.") - parser.add_argument( - "--fastspeech2-stat", - type=str, - help="mean and standard deviation used to normalize spectrogram when training fastspeech2." - ) - parser.add_argument( - "--pwg-config", type=str, help="parallel wavegan config file.") - parser.add_argument( - "--pwg-checkpoint", - type=str, - help="parallel wavegan generator parameters to load.") - parser.add_argument( - "--pwg-stat", - type=str, - help="mean and standard deviation used to normalize spectrogram when training parallel wavegan." - ) - parser.add_argument( - "--phones-dict", - type=str, - default="phone_id_map.txt", - help="phone vocabulary file.") - parser.add_argument( - "--text", - type=str, - help="text to synthesize, a 'utt_id sentence' pair per line.") - parser.add_argument("--output-dir", type=str, help="output dir.") - parser.add_argument( - "--inference-dir", type=str, help="dir to save inference models") - parser.add_argument( - "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.") - parser.add_argument("--verbose", type=int, default=1, help="verbose.") - - args = parser.parse_args() - - if args.ngpu == 0: - paddle.set_device("cpu") - elif args.ngpu > 0: - paddle.set_device("gpu") - else: - print("ngpu should >= 0 !") - - with open(args.fastspeech2_config) as f: - fastspeech2_config = CfgNode(yaml.safe_load(f)) - with open(args.pwg_config) as f: - pwg_config = CfgNode(yaml.safe_load(f)) - - print("========Args========") - print(yaml.safe_dump(vars(args))) - print("========Config========") - print(fastspeech2_config) - print(pwg_config) - - evaluate(args, fastspeech2_config, pwg_config) - - -if __name__ == "__main__": - main() diff --git a/paddlespeech/t2s/exps/fastspeech2/synthesize_e2e_en.py b/paddlespeech/t2s/exps/fastspeech2/synthesize_e2e_en.py deleted file mode 100644 index 1b980afb152774bf157c10a5cc5435ca71d62398..0000000000000000000000000000000000000000 --- a/paddlespeech/t2s/exps/fastspeech2/synthesize_e2e_en.py +++ /dev/null @@ -1,166 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import argparse -import logging -from pathlib import Path - -import numpy as np -import paddle -import soundfile as sf -import yaml -from yacs.config import CfgNode - -from paddlespeech.t2s.frontend import English -from paddlespeech.t2s.models.fastspeech2 import FastSpeech2 -from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference -from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator -from paddlespeech.t2s.models.parallel_wavegan import PWGInference -from paddlespeech.t2s.modules.normalizer import ZScore - - -def evaluate(args, fastspeech2_config, pwg_config): - # dataloader has been too verbose - logging.getLogger("DataLoader").disabled = True - - # construct dataset for evaluation - sentences = [] - with open(args.text, 'rt') as f: - for line in f: - line_list = line.strip().split() - utt_id = line_list[0] - sentence = " ".join(line_list[1:]) - sentences.append((utt_id, sentence)) - - with open(args.phones_dict, "r") as f: - phn_id = [line.strip().split() for line in f.readlines()] - vocab_size = len(phn_id) - phone_id_map = {} - for phn, id in phn_id: - phone_id_map[phn] = int(id) - print("vocab_size:", vocab_size) - odim = fastspeech2_config.n_mels - model = FastSpeech2( - idim=vocab_size, odim=odim, **fastspeech2_config["model"]) - - model.set_state_dict( - paddle.load(args.fastspeech2_checkpoint)["main_params"]) - model.eval() - - vocoder = PWGGenerator(**pwg_config["generator_params"]) - vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"]) - vocoder.remove_weight_norm() - vocoder.eval() - print("model done!") - - frontend = English(phone_vocab_path=args.phones_dict) - print("frontend done!") - - stat = np.load(args.fastspeech2_stat) - mu, std = stat - mu = paddle.to_tensor(mu) - std = paddle.to_tensor(std) - fastspeech2_normalizer = ZScore(mu, std) - - stat = np.load(args.pwg_stat) - mu, std = stat - mu = paddle.to_tensor(mu) - std = paddle.to_tensor(std) - pwg_normalizer = ZScore(mu, std) - - fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model) - pwg_inference = PWGInference(pwg_normalizer, vocoder) - - output_dir = Path(args.output_dir) - output_dir.mkdir(parents=True, exist_ok=True) - - for utt_id, sentence in sentences: - input_ids = frontend.get_input_ids(sentence) - phone_ids = input_ids["phone_ids"] - - with paddle.no_grad(): - mel = fastspeech2_inference(phone_ids) - wav = pwg_inference(mel) - - sf.write( - str(output_dir / (utt_id + ".wav")), - wav.numpy(), - samplerate=fastspeech2_config.fs) - print(f"{utt_id} done!") - - -def main(): - # parse args and config and redirect to train_sp - parser = argparse.ArgumentParser( - description="Synthesize with fastspeech2 & parallel wavegan.") - parser.add_argument( - "--fastspeech2-config", type=str, help="fastspeech2 config file.") - parser.add_argument( - "--fastspeech2-checkpoint", - type=str, - help="fastspeech2 checkpoint to load.") - parser.add_argument( - "--fastspeech2-stat", - type=str, - help="mean and standard deviation used to normalize spectrogram when training fastspeech2." - ) - parser.add_argument( - "--pwg-config", type=str, help="parallel wavegan config file.") - parser.add_argument( - "--pwg-checkpoint", - type=str, - help="parallel wavegan generator parameters to load.") - parser.add_argument( - "--pwg-stat", - type=str, - help="mean and standard deviation used to normalize spectrogram when training parallel wavegan." - ) - parser.add_argument( - "--phones-dict", - type=str, - default="phone_id_map.txt", - help="phone vocabulary file.") - parser.add_argument( - "--text", - type=str, - help="text to synthesize, a 'utt_id sentence' pair per line.") - parser.add_argument("--output-dir", type=str, help="output dir.") - parser.add_argument( - "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.") - parser.add_argument("--verbose", type=int, default=1, help="verbose.") - - args = parser.parse_args() - - if args.ngpu == 0: - paddle.set_device("cpu") - elif args.ngpu > 0: - paddle.set_device("gpu") - else: - print("ngpu should >= 0 !") - - with open(args.fastspeech2_config) as f: - fastspeech2_config = CfgNode(yaml.safe_load(f)) - with open(args.pwg_config) as f: - pwg_config = CfgNode(yaml.safe_load(f)) - - print("========Args========") - print(yaml.safe_dump(vars(args))) - print("========Config========") - print(fastspeech2_config) - print(pwg_config) - - evaluate(args, fastspeech2_config, pwg_config) - - -if __name__ == "__main__": - main() diff --git a/paddlespeech/t2s/exps/fastspeech2/synthesize_e2e_melgan.py b/paddlespeech/t2s/exps/fastspeech2/synthesize_e2e_melgan.py deleted file mode 100644 index 4d5d1ac413eb9ee512a5bfa4d0d9350744c3f15e..0000000000000000000000000000000000000000 --- a/paddlespeech/t2s/exps/fastspeech2/synthesize_e2e_melgan.py +++ /dev/null @@ -1,192 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import argparse -import logging -import os -from pathlib import Path - -import numpy as np -import paddle -import soundfile as sf -import yaml -from paddle import jit -from paddle.static import InputSpec -from yacs.config import CfgNode - -from paddlespeech.t2s.frontend.zh_frontend import Frontend -from paddlespeech.t2s.models.fastspeech2 import FastSpeech2 -from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference -from paddlespeech.t2s.models.melgan import MelGANGenerator -from paddlespeech.t2s.models.melgan import MelGANInference -from paddlespeech.t2s.modules.normalizer import ZScore - - -def evaluate(args, fastspeech2_config, melgan_config): - # dataloader has been too verbose - logging.getLogger("DataLoader").disabled = True - - # construct dataset for evaluation - sentences = [] - with open(args.text, 'rt') as f: - for line in f: - items = line.strip().split() - utt_id = items[0] - sentence = "".join(items[1:]) - sentences.append((utt_id, sentence)) - - with open(args.phones_dict, "r") as f: - phn_id = [line.strip().split() for line in f.readlines()] - vocab_size = len(phn_id) - print("vocab_size:", vocab_size) - odim = fastspeech2_config.n_mels - model = FastSpeech2( - idim=vocab_size, odim=odim, **fastspeech2_config["model"]) - - model.set_state_dict( - paddle.load(args.fastspeech2_checkpoint)["main_params"]) - model.eval() - - vocoder = MelGANGenerator(**melgan_config["generator_params"]) - vocoder.set_state_dict( - paddle.load(args.melgan_checkpoint)["generator_params"]) - vocoder.remove_weight_norm() - vocoder.eval() - print("model done!") - - frontend = Frontend(phone_vocab_path=args.phones_dict) - print("frontend done!") - - stat = np.load(args.fastspeech2_stat) - mu, std = stat - mu = paddle.to_tensor(mu) - std = paddle.to_tensor(std) - fastspeech2_normalizer = ZScore(mu, std) - - stat = np.load(args.melgan_stat) - mu, std = stat - mu = paddle.to_tensor(mu) - std = paddle.to_tensor(std) - pwg_normalizer = ZScore(mu, std) - - fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model) - fastspeech2_inference.eval() - fastspeech2_inference = jit.to_static( - fastspeech2_inference, input_spec=[InputSpec([-1], dtype=paddle.int64)]) - paddle.jit.save(fastspeech2_inference, - os.path.join(args.inference_dir, "fastspeech2")) - fastspeech2_inference = paddle.jit.load( - os.path.join(args.inference_dir, "fastspeech2")) - - mb_melgan_inference = MelGANInference(pwg_normalizer, vocoder) - mb_melgan_inference.eval() - mb_melgan_inference = jit.to_static( - mb_melgan_inference, - input_spec=[ - InputSpec([-1, 80], dtype=paddle.float32), - ]) - paddle.jit.save(mb_melgan_inference, - os.path.join(args.inference_dir, "mb_melgan")) - mb_melgan_inference = paddle.jit.load( - os.path.join(args.inference_dir, "mb_melgan")) - - output_dir = Path(args.output_dir) - output_dir.mkdir(parents=True, exist_ok=True) - - for utt_id, sentence in sentences: - input_ids = frontend.get_input_ids(sentence, merge_sentences=True) - phone_ids = input_ids["phone_ids"] - flags = 0 - for part_phone_ids in phone_ids: - with paddle.no_grad(): - mel = fastspeech2_inference(part_phone_ids) - temp_wav = mb_melgan_inference(mel) - if flags == 0: - wav = temp_wav - flags = 1 - else: - wav = paddle.concat([wav, temp_wav]) - sf.write( - str(output_dir / (utt_id + ".wav")), - wav.numpy(), - samplerate=fastspeech2_config.fs) - print(f"{utt_id} done!") - - -def main(): - # parse args and config and redirect to train_sp - parser = argparse.ArgumentParser( - description="Synthesize with fastspeech2 & parallel wavegan.") - parser.add_argument( - "--fastspeech2-config", type=str, help="fastspeech2 config file.") - parser.add_argument( - "--fastspeech2-checkpoint", - type=str, - help="fastspeech2 checkpoint to load.") - parser.add_argument( - "--fastspeech2-stat", - type=str, - help="mean and standard deviation used to normalize spectrogram when training fastspeech2." - ) - parser.add_argument( - "--melgan-config", type=str, help="parallel wavegan config file.") - parser.add_argument( - "--melgan-checkpoint", - type=str, - help="parallel wavegan generator parameters to load.") - parser.add_argument( - "--melgan-stat", - type=str, - help="mean and standard deviation used to normalize spectrogram when training parallel wavegan." - ) - parser.add_argument( - "--phones-dict", - type=str, - default="phone_id_map.txt", - help="phone vocabulary file.") - parser.add_argument( - "--text", - type=str, - help="text to synthesize, a 'utt_id sentence' pair per line.") - parser.add_argument("--output-dir", type=str, help="output dir.") - parser.add_argument( - "--inference-dir", type=str, help="dir to save inference models") - parser.add_argument( - "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.") - parser.add_argument("--verbose", type=int, default=1, help="verbose.") - - args = parser.parse_args() - - if args.ngpu == 0: - paddle.set_device("cpu") - elif args.ngpu > 0: - paddle.set_device("gpu") - else: - print("ngpu should >= 0 !") - - with open(args.fastspeech2_config) as f: - fastspeech2_config = CfgNode(yaml.safe_load(f)) - with open(args.melgan_config) as f: - melgan_config = CfgNode(yaml.safe_load(f)) - - print("========Args========") - print(yaml.safe_dump(vars(args))) - print("========Config========") - print(fastspeech2_config) - print(melgan_config) - - evaluate(args, fastspeech2_config, melgan_config) - - -if __name__ == "__main__": - main() diff --git a/paddlespeech/t2s/exps/gan_vocoder/hifigan/__init__.py b/paddlespeech/t2s/exps/gan_vocoder/hifigan/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..185a92b8d94d3426d616c0624f0f2ee04339349e --- /dev/null +++ b/paddlespeech/t2s/exps/gan_vocoder/hifigan/__init__.py @@ -0,0 +1,13 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/paddlespeech/t2s/exps/gan_vocoder/hifigan/train.py b/paddlespeech/t2s/exps/gan_vocoder/hifigan/train.py new file mode 100644 index 0000000000000000000000000000000000000000..f0e7708f329a19fcaa6e47eb2bc52dc1db728797 --- /dev/null +++ b/paddlespeech/t2s/exps/gan_vocoder/hifigan/train.py @@ -0,0 +1,277 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import logging +import os +import shutil +from pathlib import Path + +import jsonlines +import numpy as np +import paddle +import yaml +from paddle import DataParallel +from paddle import distributed as dist +from paddle import nn +from paddle.io import DataLoader +from paddle.io import DistributedBatchSampler +from paddle.optimizer import Adam +from paddle.optimizer.lr import MultiStepDecay +from yacs.config import CfgNode + +from paddlespeech.t2s.datasets.data_table import DataTable +from paddlespeech.t2s.datasets.vocoder_batch_fn import Clip +from paddlespeech.t2s.models.hifigan import HiFiGANEvaluator +from paddlespeech.t2s.models.hifigan import HiFiGANGenerator +from paddlespeech.t2s.models.hifigan import HiFiGANMultiScaleMultiPeriodDiscriminator +from paddlespeech.t2s.models.hifigan import HiFiGANUpdater +from paddlespeech.t2s.modules.losses import DiscriminatorAdversarialLoss +from paddlespeech.t2s.modules.losses import FeatureMatchLoss +from paddlespeech.t2s.modules.losses import GeneratorAdversarialLoss +from paddlespeech.t2s.modules.losses import MelSpectrogramLoss +from paddlespeech.t2s.training.extensions.snapshot import Snapshot +from paddlespeech.t2s.training.extensions.visualizer import VisualDL +from paddlespeech.t2s.training.seeding import seed_everything +from paddlespeech.t2s.training.trainer import Trainer + + +def train_sp(args, config): + # decides device type and whether to run in parallel + # setup running environment correctly + world_size = paddle.distributed.get_world_size() + if (not paddle.is_compiled_with_cuda()) or args.ngpu == 0: + paddle.set_device("cpu") + else: + paddle.set_device("gpu") + if world_size > 1: + paddle.distributed.init_parallel_env() + + # set the random seed, it is a must for multiprocess training + seed_everything(config.seed) + + print( + f"rank: {dist.get_rank()}, pid: {os.getpid()}, parent_pid: {os.getppid()}", + ) + + # dataloader has been too verbose + logging.getLogger("DataLoader").disabled = True + + # construct dataset for training and validation + with jsonlines.open(args.train_metadata, 'r') as reader: + train_metadata = list(reader) + train_dataset = DataTable( + data=train_metadata, + fields=["wave", "feats"], + converters={ + "wave": np.load, + "feats": np.load, + }, ) + with jsonlines.open(args.dev_metadata, 'r') as reader: + dev_metadata = list(reader) + dev_dataset = DataTable( + data=dev_metadata, + fields=["wave", "feats"], + converters={ + "wave": np.load, + "feats": np.load, + }, ) + + # collate function and dataloader + train_sampler = DistributedBatchSampler( + train_dataset, + batch_size=config.batch_size, + shuffle=True, + drop_last=True) + dev_sampler = DistributedBatchSampler( + dev_dataset, + batch_size=config.batch_size, + shuffle=False, + drop_last=False) + print("samplers done!") + + if "aux_context_window" in config.generator_params: + aux_context_window = config.generator_params.aux_context_window + else: + aux_context_window = 0 + train_batch_fn = Clip( + batch_max_steps=config.batch_max_steps, + hop_size=config.n_shift, + aux_context_window=aux_context_window) + + train_dataloader = DataLoader( + train_dataset, + batch_sampler=train_sampler, + collate_fn=train_batch_fn, + num_workers=config.num_workers) + + dev_dataloader = DataLoader( + dev_dataset, + batch_sampler=dev_sampler, + collate_fn=train_batch_fn, + num_workers=config.num_workers) + print("dataloaders done!") + + generator = HiFiGANGenerator(**config["generator_params"]) + discriminator = HiFiGANMultiScaleMultiPeriodDiscriminator( + **config["discriminator_params"]) + if world_size > 1: + generator = DataParallel(generator) + discriminator = DataParallel(discriminator) + print("models done!") + + criterion_feat_match = FeatureMatchLoss(**config["feat_match_loss_params"]) + criterion_mel = MelSpectrogramLoss( + fs=config.fs, + fft_size=config.n_fft, + hop_size=config.n_shift, + win_length=config.win_length, + window=config.window, + num_mels=config.n_mels, + fmin=config.fmin, + fmax=config.fmax, ) + criterion_gen_adv = GeneratorAdversarialLoss( + **config["generator_adv_loss_params"]) + criterion_dis_adv = DiscriminatorAdversarialLoss( + **config["discriminator_adv_loss_params"]) + print("criterions done!") + + lr_schedule_g = MultiStepDecay(**config["generator_scheduler_params"]) + # Compared to multi_band_melgan.v1 config, Adam optimizer without gradient norm is used + generator_grad_norm = config["generator_grad_norm"] + gradient_clip_g = nn.ClipGradByGlobalNorm( + generator_grad_norm) if generator_grad_norm > 0 else None + print("gradient_clip_g:", gradient_clip_g) + + optimizer_g = Adam( + learning_rate=lr_schedule_g, + grad_clip=gradient_clip_g, + parameters=generator.parameters(), + **config["generator_optimizer_params"]) + lr_schedule_d = MultiStepDecay(**config["discriminator_scheduler_params"]) + discriminator_grad_norm = config["discriminator_grad_norm"] + gradient_clip_d = nn.ClipGradByGlobalNorm( + discriminator_grad_norm) if discriminator_grad_norm > 0 else None + print("gradient_clip_d:", gradient_clip_d) + optimizer_d = Adam( + learning_rate=lr_schedule_d, + grad_clip=gradient_clip_d, + parameters=discriminator.parameters(), + **config["discriminator_optimizer_params"]) + print("optimizers done!") + + output_dir = Path(args.output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + if dist.get_rank() == 0: + config_name = args.config.split("/")[-1] + # copy conf to output_dir + shutil.copyfile(args.config, output_dir / config_name) + + updater = HiFiGANUpdater( + models={ + "generator": generator, + "discriminator": discriminator, + }, + optimizers={ + "generator": optimizer_g, + "discriminator": optimizer_d, + }, + criterions={ + "mel": criterion_mel, + "feat_match": criterion_feat_match, + "gen_adv": criterion_gen_adv, + "dis_adv": criterion_dis_adv, + }, + schedulers={ + "generator": lr_schedule_g, + "discriminator": lr_schedule_d, + }, + dataloader=train_dataloader, + discriminator_train_start_steps=config.discriminator_train_start_steps, + # only hifigan have generator_train_start_steps + generator_train_start_steps=config.generator_train_start_steps, + lambda_adv=config.lambda_adv, + lambda_aux=config.lambda_aux, + lambda_feat_match=config.lambda_feat_match, + output_dir=output_dir) + + evaluator = HiFiGANEvaluator( + models={ + "generator": generator, + "discriminator": discriminator, + }, + criterions={ + "mel": criterion_mel, + "feat_match": criterion_feat_match, + "gen_adv": criterion_gen_adv, + "dis_adv": criterion_dis_adv, + }, + dataloader=dev_dataloader, + lambda_adv=config.lambda_adv, + lambda_aux=config.lambda_aux, + lambda_feat_match=config.lambda_feat_match, + output_dir=output_dir) + + trainer = Trainer( + updater, + stop_trigger=(config.train_max_steps, "iteration"), + out=output_dir) + + if dist.get_rank() == 0: + trainer.extend( + evaluator, trigger=(config.eval_interval_steps, 'iteration')) + trainer.extend(VisualDL(output_dir), trigger=(1, 'iteration')) + trainer.extend( + Snapshot(max_size=config.num_snapshots), + trigger=(config.save_interval_steps, 'iteration')) + + print("Trainer Done!") + trainer.run() + + +def main(): + # parse args and config and redirect to train_sp + + parser = argparse.ArgumentParser( + description="Train a HiFiGAN model.") + parser.add_argument( + "--config", type=str, help="config file to overwrite default config.") + parser.add_argument("--train-metadata", type=str, help="training data.") + parser.add_argument("--dev-metadata", type=str, help="dev data.") + parser.add_argument("--output-dir", type=str, help="output dir.") + parser.add_argument( + "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.") + parser.add_argument("--verbose", type=int, default=1, help="verbose.") + + args = parser.parse_args() + + with open(args.config, 'rt') as f: + config = CfgNode(yaml.safe_load(f)) + + print("========Args========") + print(yaml.safe_dump(vars(args))) + print("========Config========") + print(config) + print( + f"master see the word size: {dist.get_world_size()}, from pid: {os.getpid()}" + ) + + # dispatch + if args.ngpu > 1: + dist.spawn(train_sp, (args, config), nprocs=args.ngpu) + else: + train_sp(args, config) + + +if __name__ == "__main__": + main() diff --git a/paddlespeech/t2s/exps/gan_vocoder/synthesize.py b/paddlespeech/t2s/exps/gan_vocoder/synthesize.py index d7fd2f9441abbce55b71998b7715650ead978851..6f4dc92dbc9a74f4735880e5ce297a898e3e1014 100644 --- a/paddlespeech/t2s/exps/gan_vocoder/synthesize.py +++ b/paddlespeech/t2s/exps/gan_vocoder/synthesize.py @@ -84,6 +84,7 @@ def main(): generator.set_state_dict(state_dict["generator_params"]) generator.remove_weight_norm() generator.eval() + with jsonlines.open(args.test_metadata, 'r') as reader: metadata = list(reader) test_dataset = DataTable( diff --git a/paddlespeech/t2s/exps/inference.py b/paddlespeech/t2s/exps/inference.py new file mode 100644 index 0000000000000000000000000000000000000000..e1d5306c25aa762f543b397809f6f6be9a781888 --- /dev/null +++ b/paddlespeech/t2s/exps/inference.py @@ -0,0 +1,136 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +from pathlib import Path + +import soundfile as sf +from paddle import inference + +from paddlespeech.t2s.frontend.zh_frontend import Frontend + + +# only inference for models trained with csmsc now +def main(): + parser = argparse.ArgumentParser( + description="Paddle Infernce with speedyspeech & parallel wavegan.") + # acoustic model + parser.add_argument( + '--am', + type=str, + default='fastspeech2_csmsc', + choices=['speedyspeech_csmsc', 'fastspeech2_csmsc'], + help='Choose acoustic model type of tts task.') + parser.add_argument( + "--phones_dict", type=str, default=None, help="phone vocabulary file.") + parser.add_argument( + "--tones_dict", type=str, default=None, help="tone vocabulary file.") + # voc + parser.add_argument( + '--voc', + type=str, + default='pwgan_csmsc', + choices=['pwgan_csmsc', 'mb_melgan_csmsc', 'hifigan_csmsc'], + help='Choose vocoder type of tts task.') + # other + parser.add_argument( + "--text", + type=str, + help="text to synthesize, a 'utt_id sentence' pair per line") + parser.add_argument( + "--inference_dir", type=str, help="dir to save inference models") + parser.add_argument("--output_dir", type=str, help="output dir") + + args, _ = parser.parse_known_args() + + frontend = Frontend( + phone_vocab_path=args.phones_dict, tone_vocab_path=args.tones_dict) + print("frontend done!") + + # model: {model_name}_{dataset} + am_name = args.am[:args.am.rindex('_')] + am_dataset = args.am[args.am.rindex('_') + 1:] + + am_config = inference.Config( + str(Path(args.inference_dir) / (args.am + ".pdmodel")), + str(Path(args.inference_dir) / (args.am + ".pdiparams"))) + am_config.enable_use_gpu(100, 0) + # This line must be commented for fastspeech2, if not, it will OOM + if am_name != 'fastspeech2': + am_config.enable_memory_optim() + am_predictor = inference.create_predictor(am_config) + + voc_config = inference.Config( + str(Path(args.inference_dir) / (args.voc + ".pdmodel")), + str(Path(args.inference_dir) / (args.voc + ".pdiparams"))) + voc_config.enable_use_gpu(100, 0) + voc_config.enable_memory_optim() + voc_predictor = inference.create_predictor(voc_config) + + output_dir = Path(args.output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + sentences = [] + + print("in new inference") + + with open(args.text, 'rt') as f: + for line in f: + items = line.strip().split() + utt_id = items[0] + sentence = "".join(items[1:]) + sentences.append((utt_id, sentence)) + + get_tone_ids = False + if am_name == 'speedyspeech': + get_tone_ids = True + + am_input_names = am_predictor.get_input_names() + + for utt_id, sentence in sentences: + input_ids = frontend.get_input_ids( + sentence, merge_sentences=True, get_tone_ids=get_tone_ids) + phone_ids = input_ids["phone_ids"] + if get_tone_ids: + tone_ids = input_ids["tone_ids"] + tones = tone_ids[0].numpy() + tones_handle = am_predictor.get_input_handle(am_input_names[1]) + tones_handle.reshape(tones.shape) + tones_handle.copy_from_cpu(tones) + + phones = phone_ids[0].numpy() + phones_handle = am_predictor.get_input_handle(am_input_names[0]) + phones_handle.reshape(phones.shape) + phones_handle.copy_from_cpu(phones) + + am_predictor.run() + am_output_names = am_predictor.get_output_names() + am_output_handle = am_predictor.get_output_handle(am_output_names[0]) + am_output_data = am_output_handle.copy_to_cpu() + + voc_input_names = voc_predictor.get_input_names() + mel_handle = voc_predictor.get_input_handle(voc_input_names[0]) + mel_handle.reshape(am_output_data.shape) + mel_handle.copy_from_cpu(am_output_data) + + voc_predictor.run() + voc_output_names = voc_predictor.get_output_names() + voc_output_handle = voc_predictor.get_output_handle(voc_output_names[0]) + wav = voc_output_handle.copy_to_cpu() + + sf.write(output_dir / (utt_id + ".wav"), wav, samplerate=24000) + + print(f"{utt_id} done!") + + +if __name__ == "__main__": + main() diff --git a/paddlespeech/t2s/exps/speedyspeech/inference.py b/paddlespeech/t2s/exps/speedyspeech/inference.py index 0ed2e0bf104e30301ecb20fceace28f9ef26bcad..d4958bc49a8494dd4b50cf373ef422e61f57852e 100644 --- a/paddlespeech/t2s/exps/speedyspeech/inference.py +++ b/paddlespeech/t2s/exps/speedyspeech/inference.py @@ -11,8 +11,8 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. +# remain for chains import argparse -import os from pathlib import Path import soundfile as sf @@ -31,8 +31,6 @@ def main(): type=str, help="text to synthesize, a 'utt_id sentence' pair per line") parser.add_argument("--output-dir", type=str, help="output dir") - parser.add_argument( - "--enable-auto-log", action="store_true", help="use auto log") parser.add_argument( "--phones-dict", type=str, @@ -64,23 +62,6 @@ def main(): pwg_config.enable_memory_optim() pwg_predictor = inference.create_predictor(pwg_config) - if args.enable_auto_log: - import auto_log - os.makedirs("output", exist_ok=True) - pid = os.getpid() - logger = auto_log.AutoLogger( - model_name="speedyspeech", - model_precision='float32', - batch_size=1, - data_shape="dynamic", - save_path="./output/auto_log.log", - inference_config=speedyspeech_config, - pids=pid, - process_name=None, - gpu_ids=0, - time_keys=['preprocess_time', 'inference_time', 'postprocess_time'], - warmup=0) - output_dir = Path(args.output_dir) output_dir.mkdir(parents=True, exist_ok=True) sentences = [] @@ -93,9 +74,6 @@ def main(): sentences.append((utt_id, sentence)) for utt_id, sentence in sentences: - if args.enable_auto_log: - logger.times.start() - input_ids = frontend.get_input_ids( sentence, merge_sentences=True, get_tone_ids=True) phone_ids = input_ids["phone_ids"] @@ -103,9 +81,6 @@ def main(): phones = phone_ids[0].numpy() tones = tone_ids[0].numpy() - if args.enable_auto_log: - logger.times.stamp() - input_names = speedyspeech_predictor.get_input_names() phones_handle = speedyspeech_predictor.get_input_handle(input_names[0]) tones_handle = speedyspeech_predictor.get_input_handle(input_names[1]) @@ -131,18 +106,10 @@ def main(): output_handle = pwg_predictor.get_output_handle(output_names[0]) wav = output_data = output_handle.copy_to_cpu() - if args.enable_auto_log: - logger.times.stamp() - sf.write(output_dir / (utt_id + ".wav"), wav, samplerate=24000) - if args.enable_auto_log: - logger.times.end(stamp=True) print(f"{utt_id} done!") - if args.enable_auto_log: - logger.report() - if __name__ == "__main__": main() diff --git a/paddlespeech/t2s/exps/speedyspeech/synthesize.py b/paddlespeech/t2s/exps/speedyspeech/synthesize.py deleted file mode 100644 index 67d56ea54d9997496790f686eda78f332a73be37..0000000000000000000000000000000000000000 --- a/paddlespeech/t2s/exps/speedyspeech/synthesize.py +++ /dev/null @@ -1,185 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import argparse -import logging -import os -from pathlib import Path - -import jsonlines -import numpy as np -import paddle -import soundfile as sf -import yaml -from paddle import jit -from paddle.static import InputSpec -from yacs.config import CfgNode - -from paddlespeech.t2s.datasets.data_table import DataTable -from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator -from paddlespeech.t2s.models.parallel_wavegan import PWGInference -from paddlespeech.t2s.models.speedyspeech import SpeedySpeech -from paddlespeech.t2s.models.speedyspeech import SpeedySpeechInference -from paddlespeech.t2s.modules.normalizer import ZScore - - -def evaluate(args, speedyspeech_config, pwg_config): - # dataloader has been too verbose - logging.getLogger("DataLoader").disabled = True - - # construct dataset for evaluation - with jsonlines.open(args.test_metadata, 'r') as reader: - test_metadata = list(reader) - test_dataset = DataTable( - data=test_metadata, fields=["utt_id", "phones", "tones"]) - - with open(args.phones_dict, "r") as f: - phn_id = [line.strip().split() for line in f.readlines()] - vocab_size = len(phn_id) - print("vocab_size:", vocab_size) - with open(args.tones_dict, "r") as f: - tone_id = [line.strip().split() for line in f.readlines()] - tone_size = len(tone_id) - print("tone_size:", tone_size) - - model = SpeedySpeech( - vocab_size=vocab_size, - tone_size=tone_size, - **speedyspeech_config["model"]) - model.set_state_dict( - paddle.load(args.speedyspeech_checkpoint)["main_params"]) - model.eval() - - vocoder = PWGGenerator(**pwg_config["generator_params"]) - vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"]) - vocoder.remove_weight_norm() - vocoder.eval() - print("model done!") - - stat = np.load(args.speedyspeech_stat) - mu, std = stat - mu = paddle.to_tensor(mu) - std = paddle.to_tensor(std) - speedyspeech_normalizer = ZScore(mu, std) - speedyspeech_normalizer.eval() - - stat = np.load(args.pwg_stat) - mu, std = stat - mu = paddle.to_tensor(mu) - std = paddle.to_tensor(std) - pwg_normalizer = ZScore(mu, std) - pwg_normalizer.eval() - - speedyspeech_inference = SpeedySpeechInference(speedyspeech_normalizer, - model) - speedyspeech_inference.eval() - speedyspeech_inference = jit.to_static( - speedyspeech_inference, - input_spec=[ - InputSpec([-1], dtype=paddle.int64), InputSpec( - [-1], dtype=paddle.int64) - ]) - paddle.jit.save(speedyspeech_inference, - os.path.join(args.inference_dir, "speedyspeech")) - speedyspeech_inference = paddle.jit.load( - os.path.join(args.inference_dir, "speedyspeech")) - - pwg_inference = PWGInference(pwg_normalizer, vocoder) - pwg_inference.eval() - pwg_inference = jit.to_static( - pwg_inference, input_spec=[ - InputSpec([-1, 80], dtype=paddle.float32), - ]) - paddle.jit.save(pwg_inference, os.path.join(args.inference_dir, "pwg")) - pwg_inference = paddle.jit.load(os.path.join(args.inference_dir, "pwg")) - - output_dir = Path(args.output_dir) - output_dir.mkdir(parents=True, exist_ok=True) - - for datum in test_dataset: - utt_id = datum["utt_id"] - phones = paddle.to_tensor(datum["phones"]) - tones = paddle.to_tensor(datum["tones"]) - - with paddle.no_grad(): - wav = pwg_inference(speedyspeech_inference(phones, tones)) - sf.write( - output_dir / (utt_id + ".wav"), - wav.numpy(), - samplerate=speedyspeech_config.fs) - print(f"{utt_id} done!") - - -def main(): - # parse args and config and redirect to train_sp - parser = argparse.ArgumentParser( - description="Synthesize with speedyspeech & parallel wavegan.") - parser.add_argument( - "--speedyspeech-config", type=str, help="config file for speedyspeech.") - parser.add_argument( - "--speedyspeech-checkpoint", - type=str, - help="speedyspeech checkpoint to load.") - parser.add_argument( - "--speedyspeech-stat", - type=str, - help="mean and standard deviation used to normalize spectrogram when training speedyspeech." - ) - parser.add_argument( - "--pwg-config", type=str, help="config file for parallelwavegan.") - parser.add_argument( - "--pwg-checkpoint", - type=str, - help="parallel wavegan generator parameters to load.") - parser.add_argument( - "--pwg-stat", - type=str, - help="mean and standard deviation used to normalize spectrogram when training speedyspeech." - ) - parser.add_argument( - "--phones-dict", type=str, default=None, help="phone vocabulary file.") - parser.add_argument( - "--tones-dict", type=str, default=None, help="tone vocabulary file.") - parser.add_argument("--test-metadata", type=str, help="test metadata") - parser.add_argument("--output-dir", type=str, help="output dir") - parser.add_argument( - "--inference-dir", type=str, help="dir to save inference models") - parser.add_argument( - "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.") - parser.add_argument("--verbose", type=int, default=1, help="verbose") - - args, _ = parser.parse_known_args() - - if args.ngpu == 0: - paddle.set_device("cpu") - elif args.ngpu > 0: - paddle.set_device("gpu") - else: - print("ngpu should >= 0 !") - - with open(args.speedyspeech_config) as f: - speedyspeech_config = CfgNode(yaml.safe_load(f)) - with open(args.pwg_config) as f: - pwg_config = CfgNode(yaml.safe_load(f)) - - print("========Args========") - print(yaml.safe_dump(vars(args))) - print("========Config========") - print(speedyspeech_config) - print(pwg_config) - - evaluate(args, speedyspeech_config, pwg_config) - - -if __name__ == "__main__": - main() diff --git a/paddlespeech/t2s/exps/speedyspeech/synthesize_e2e.py b/paddlespeech/t2s/exps/speedyspeech/synthesize_e2e.py index 403d35088276d6ac2270185149f361a5fa1de8ee..2854d0555ad3041de9a6c4d35b6fd1c673a5042f 100644 --- a/paddlespeech/t2s/exps/speedyspeech/synthesize_e2e.py +++ b/paddlespeech/t2s/exps/speedyspeech/synthesize_e2e.py @@ -11,6 +11,7 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. +# remain for chains import argparse import logging import os diff --git a/paddlespeech/t2s/exps/speedyspeech/train.py b/paddlespeech/t2s/exps/speedyspeech/train.py index d9a2fbf44a6d92fc481a93d4d7fe775964071cb3..001e22aea6d608566f1c6a8c16f84eb6abb68bf4 100644 --- a/paddlespeech/t2s/exps/speedyspeech/train.py +++ b/paddlespeech/t2s/exps/speedyspeech/train.py @@ -161,7 +161,7 @@ def train_sp(args, config): def main(): # parse args and config and redirect to train_sp parser = argparse.ArgumentParser( - description="Train a Speedyspeech model with sigle speaker dataset.") + description="Train a Speedyspeech model with a single speaker dataset.") parser.add_argument("--config", type=str, help="config file.") parser.add_argument("--train-metadata", type=str, help="training data.") parser.add_argument("--dev-metadata", type=str, help="dev data.") diff --git a/paddlespeech/t2s/exps/synthesize.py b/paddlespeech/t2s/exps/synthesize.py new file mode 100644 index 0000000000000000000000000000000000000000..f54774704a86d34cf00a8d01ac827ad4bfc84d80 --- /dev/null +++ b/paddlespeech/t2s/exps/synthesize.py @@ -0,0 +1,268 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import logging +from pathlib import Path + +import jsonlines +import numpy as np +import paddle +import soundfile as sf +import yaml +from yacs.config import CfgNode + +from paddlespeech.s2t.utils.dynamic_import import dynamic_import +from paddlespeech.t2s.datasets.data_table import DataTable +from paddlespeech.t2s.modules.normalizer import ZScore + +model_alias = { + # acoustic model + "speedyspeech": + "paddlespeech.t2s.models.speedyspeech:SpeedySpeech", + "speedyspeech_inference": + "paddlespeech.t2s.models.speedyspeech:SpeedySpeechInference", + "fastspeech2": + "paddlespeech.t2s.models.fastspeech2:FastSpeech2", + "fastspeech2_inference": + "paddlespeech.t2s.models.fastspeech2:FastSpeech2Inference", + # voc + "pwgan": + "paddlespeech.t2s.models.parallel_wavegan:PWGGenerator", + "pwgan_inference": + "paddlespeech.t2s.models.parallel_wavegan:PWGInference", + "mb_melgan": + "paddlespeech.t2s.models.melgan:MelGANGenerator", + "mb_melgan_inference": + "paddlespeech.t2s.models.melgan:MelGANInference", +} + + +def evaluate(args): + # dataloader has been too verbose + logging.getLogger("DataLoader").disabled = True + + # construct dataset for evaluation + with jsonlines.open(args.test_metadata, 'r') as reader: + test_metadata = list(reader) + + # Init body. + with open(args.am_config) as f: + am_config = CfgNode(yaml.safe_load(f)) + with open(args.voc_config) as f: + voc_config = CfgNode(yaml.safe_load(f)) + + print("========Args========") + print(yaml.safe_dump(vars(args))) + print("========Config========") + print(am_config) + print(voc_config) + + # construct dataset for evaluation + + # model: {model_name}_{dataset} + am_name = args.am[:args.am.rindex('_')] + am_dataset = args.am[args.am.rindex('_') + 1:] + + if am_name == 'fastspeech2': + fields = ["utt_id", "text"] + spk_num = None + if am_dataset in {"aishell3", "vctk"} and args.speaker_dict: + print("multiple speaker fastspeech2!") + with open(args.speaker_dict, 'rt') as f: + spk_id = [line.strip().split() for line in f.readlines()] + spk_num = len(spk_id) + fields += ["spk_id"] + elif args.voice_cloning: + print("voice cloning!") + fields += ["spk_emb"] + else: + print("single speaker fastspeech2!") + print("spk_num:", spk_num) + elif am_name == 'speedyspeech': + fields = ["utt_id", "phones", "tones"] + + test_dataset = DataTable(data=test_metadata, fields=fields) + + with open(args.phones_dict, "r") as f: + phn_id = [line.strip().split() for line in f.readlines()] + vocab_size = len(phn_id) + print("vocab_size:", vocab_size) + + tone_size = None + if args.tones_dict: + with open(args.tones_dict, "r") as f: + tone_id = [line.strip().split() for line in f.readlines()] + tone_size = len(tone_id) + print("tone_size:", tone_size) + + # acoustic model + odim = am_config.n_mels + am_class = dynamic_import(am_name, model_alias) + am_inference_class = dynamic_import(am_name + '_inference', model_alias) + + if am_name == 'fastspeech2': + am = am_class( + idim=vocab_size, odim=odim, spk_num=spk_num, **am_config["model"]) + elif am_name == 'speedyspeech': + am = am_class( + vocab_size=vocab_size, tone_size=tone_size, **am_config["model"]) + + am.set_state_dict(paddle.load(args.am_ckpt)["main_params"]) + am.eval() + am_mu, am_std = np.load(args.am_stat) + am_mu = paddle.to_tensor(am_mu) + am_std = paddle.to_tensor(am_std) + am_normalizer = ZScore(am_mu, am_std) + am_inference = am_inference_class(am_normalizer, am) + print("am_inference.training0:", am_inference.training) + am_inference.eval() + print("acoustic model done!") + + # vocoder + # model: {model_name}_{dataset} + voc_name = args.voc[:args.voc.rindex('_')] + voc_class = dynamic_import(voc_name, model_alias) + voc_inference_class = dynamic_import(voc_name + '_inference', model_alias) + voc = voc_class(**voc_config["generator_params"]) + voc.set_state_dict(paddle.load(args.voc_ckpt)["generator_params"]) + voc.remove_weight_norm() + voc.eval() + voc_mu, voc_std = np.load(args.voc_stat) + voc_mu = paddle.to_tensor(voc_mu) + voc_std = paddle.to_tensor(voc_std) + voc_normalizer = ZScore(voc_mu, voc_std) + voc_inference = voc_inference_class(voc_normalizer, voc) + print("voc_inference.training0:", voc_inference.training) + voc_inference.eval() + print("voc done!") + + output_dir = Path(args.output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + + for datum in test_dataset: + utt_id = datum["utt_id"] + with paddle.no_grad(): + # acoustic model + if am_name == 'fastspeech2': + phone_ids = paddle.to_tensor(datum["text"]) + spk_emb = None + spk_id = None + # multi speaker + if args.voice_cloning and "spk_emb" in datum: + spk_emb = paddle.to_tensor(np.load(datum["spk_emb"])) + elif "spk_id" in datum: + spk_id = paddle.to_tensor(datum["spk_id"]) + mel = am_inference(phone_ids, spk_id=spk_id, spk_emb=spk_emb) + elif am_name == 'speedyspeech': + phone_ids = paddle.to_tensor(datum["phones"]) + tone_ids = paddle.to_tensor(datum["tones"]) + mel = am_inference(phone_ids, tone_ids) + # vocoder + wav = voc_inference(mel) + sf.write( + str(output_dir / (utt_id + ".wav")), + wav.numpy(), + samplerate=am_config.fs) + print(f"{utt_id} done!") + + +def main(): + # parse args and config and redirect to train_sp + parser = argparse.ArgumentParser( + description="Synthesize with acoustic model & vocoder") + # acoustic model + parser.add_argument( + '--am', + type=str, + default='fastspeech2_csmsc', + choices=[ + 'speedyspeech_csmsc', 'fastspeech2_csmsc', 'fastspeech2_ljspeech', + 'fastspeech2_aishell3', 'fastspeech2_vctk' + ], + help='Choose acoustic model type of tts task.') + parser.add_argument( + '--am_config', + type=str, + default=None, + help='Config of acoustic model. Use deault config when it is None.') + parser.add_argument( + '--am_ckpt', + type=str, + default=None, + help='Checkpoint file of acoustic model.') + parser.add_argument( + "--am_stat", + type=str, + default=None, + help="mean and standard deviation used to normalize spectrogram when training acoustic model." + ) + parser.add_argument( + "--phones_dict", type=str, default=None, help="phone vocabulary file.") + parser.add_argument( + "--tones_dict", type=str, default=None, help="tone vocabulary file.") + parser.add_argument( + "--speaker_dict", type=str, default=None, help="speaker id map file.") + + def str2bool(str): + return True if str.lower() == 'true' else False + + parser.add_argument( + "--voice-cloning", + type=str2bool, + default=False, + help="whether training voice cloning model.") + # vocoder + parser.add_argument( + '--voc', + type=str, + default='pwgan_csmsc', + choices=[ + 'pwgan_csmsc', 'pwgan_ljspeech', 'pwgan_aishell3', 'pwgan_vctk', + 'mb_melgan_csmsc' + ], + help='Choose vocoder type of tts task.') + + parser.add_argument( + '--voc_config', + type=str, + default=None, + help='Config of voc. Use deault config when it is None.') + parser.add_argument( + '--voc_ckpt', type=str, default=None, help='Checkpoint file of voc.') + parser.add_argument( + "--voc_stat", + type=str, + default=None, + help="mean and standard deviation used to normalize spectrogram when training voc." + ) + # other + parser.add_argument( + "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.") + parser.add_argument("--test_metadata", type=str, help="test metadata.") + parser.add_argument("--output_dir", type=str, help="output dir.") + + args = parser.parse_args() + + if args.ngpu == 0: + paddle.set_device("cpu") + elif args.ngpu > 0: + paddle.set_device("gpu") + else: + print("ngpu should >= 0 !") + + evaluate(args) + + +if __name__ == "__main__": + main() diff --git a/paddlespeech/t2s/exps/synthesize_e2e.py b/paddlespeech/t2s/exps/synthesize_e2e.py new file mode 100644 index 0000000000000000000000000000000000000000..9a83ec1bc22caf9d12a64b66510bc89854d8089f --- /dev/null +++ b/paddlespeech/t2s/exps/synthesize_e2e.py @@ -0,0 +1,336 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import os +from pathlib import Path + +import numpy as np +import paddle +import soundfile as sf +import yaml +from paddle import jit +from paddle.static import InputSpec +from yacs.config import CfgNode + +from paddlespeech.s2t.utils.dynamic_import import dynamic_import +from paddlespeech.t2s.frontend import English +from paddlespeech.t2s.frontend.zh_frontend import Frontend +from paddlespeech.t2s.modules.normalizer import ZScore + +model_alias = { + # acoustic model + "speedyspeech": + "paddlespeech.t2s.models.speedyspeech:SpeedySpeech", + "speedyspeech_inference": + "paddlespeech.t2s.models.speedyspeech:SpeedySpeechInference", + "fastspeech2": + "paddlespeech.t2s.models.fastspeech2:FastSpeech2", + "fastspeech2_inference": + "paddlespeech.t2s.models.fastspeech2:FastSpeech2Inference", + # voc + "pwgan": + "paddlespeech.t2s.models.parallel_wavegan:PWGGenerator", + "pwgan_inference": + "paddlespeech.t2s.models.parallel_wavegan:PWGInference", + "mb_melgan": + "paddlespeech.t2s.models.melgan:MelGANGenerator", + "mb_melgan_inference": + "paddlespeech.t2s.models.melgan:MelGANInference", + "style_melgan": + "paddlespeech.t2s.models.melgan:StyleMelGANGenerator", + "style_melgan_inference": + "paddlespeech.t2s.models.melgan:StyleMelGANInference", + "hifigan": + "paddlespeech.t2s.models.hifigan:HiFiGANGenerator", + "hifigan_inference": + "paddlespeech.t2s.models.hifigan:HiFiGANInference", +} + + +def evaluate(args): + + # Init body. + with open(args.am_config) as f: + am_config = CfgNode(yaml.safe_load(f)) + with open(args.voc_config) as f: + voc_config = CfgNode(yaml.safe_load(f)) + + print("========Args========") + print(yaml.safe_dump(vars(args))) + print("========Config========") + print(am_config) + print(voc_config) + + # construct dataset for evaluation + sentences = [] + with open(args.text, 'rt') as f: + for line in f: + items = line.strip().split() + utt_id = items[0] + if args.lang == 'zh': + sentence = "".join(items[1:]) + elif args.lang == 'en': + sentence = " ".join(items[1:]) + sentences.append((utt_id, sentence)) + + with open(args.phones_dict, "r") as f: + phn_id = [line.strip().split() for line in f.readlines()] + vocab_size = len(phn_id) + print("vocab_size:", vocab_size) + + tone_size = None + if args.tones_dict: + with open(args.tones_dict, "r") as f: + tone_id = [line.strip().split() for line in f.readlines()] + tone_size = len(tone_id) + print("tone_size:", tone_size) + + spk_num = None + if args.speaker_dict: + with open(args.speaker_dict, 'rt') as f: + spk_id = [line.strip().split() for line in f.readlines()] + spk_num = len(spk_id) + print("spk_num:", spk_num) + + # frontend + if args.lang == 'zh': + frontend = Frontend( + phone_vocab_path=args.phones_dict, tone_vocab_path=args.tones_dict) + elif args.lang == 'en': + frontend = English(phone_vocab_path=args.phones_dict) + print("frontend done!") + + # acoustic model + odim = am_config.n_mels + # model: {model_name}_{dataset} + am_name = args.am[:args.am.rindex('_')] + am_dataset = args.am[args.am.rindex('_') + 1:] + + am_class = dynamic_import(am_name, model_alias) + am_inference_class = dynamic_import(am_name + '_inference', model_alias) + + if am_name == 'fastspeech2': + am = am_class( + idim=vocab_size, odim=odim, spk_num=spk_num, **am_config["model"]) + elif am_name == 'speedyspeech': + am = am_class( + vocab_size=vocab_size, tone_size=tone_size, **am_config["model"]) + + am.set_state_dict(paddle.load(args.am_ckpt)["main_params"]) + am.eval() + am_mu, am_std = np.load(args.am_stat) + am_mu = paddle.to_tensor(am_mu) + am_std = paddle.to_tensor(am_std) + am_normalizer = ZScore(am_mu, am_std) + am_inference = am_inference_class(am_normalizer, am) + am_inference.eval() + print("acoustic model done!") + + # vocoder + # model: {model_name}_{dataset} + voc_name = args.voc[:args.voc.rindex('_')] + voc_class = dynamic_import(voc_name, model_alias) + voc_inference_class = dynamic_import(voc_name + '_inference', model_alias) + voc = voc_class(**voc_config["generator_params"]) + voc.set_state_dict(paddle.load(args.voc_ckpt)["generator_params"]) + voc.remove_weight_norm() + voc.eval() + voc_mu, voc_std = np.load(args.voc_stat) + voc_mu = paddle.to_tensor(voc_mu) + voc_std = paddle.to_tensor(voc_std) + voc_normalizer = ZScore(voc_mu, voc_std) + voc_inference = voc_inference_class(voc_normalizer, voc) + voc_inference.eval() + print("voc done!") + + # whether dygraph to static + if args.inference_dir: + # acoustic model + if am_name == 'fastspeech2': + if am_dataset in {"aishell3", "vctk"} and args.speaker_dict: + print( + "Haven't test dygraph to static for multi speaker fastspeech2 now!" + ) + else: + am_inference = jit.to_static( + am_inference, + input_spec=[InputSpec([-1], dtype=paddle.int64)]) + paddle.jit.save(am_inference, + os.path.join(args.inference_dir, args.am)) + am_inference = paddle.jit.load( + os.path.join(args.inference_dir, args.am)) + elif am_name == 'speedyspeech': + am_inference = jit.to_static( + am_inference, + input_spec=[ + InputSpec([-1], dtype=paddle.int64), + InputSpec([-1], dtype=paddle.int64) + ]) + + paddle.jit.save(am_inference, + os.path.join(args.inference_dir, args.am)) + am_inference = paddle.jit.load( + os.path.join(args.inference_dir, args.am)) + + # vocoder + voc_inference = jit.to_static( + voc_inference, + input_spec=[ + InputSpec([-1, 80], dtype=paddle.float32), + ]) + paddle.jit.save(voc_inference, + os.path.join(args.inference_dir, args.voc)) + voc_inference = paddle.jit.load( + os.path.join(args.inference_dir, args.voc)) + + output_dir = Path(args.output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + + for utt_id, sentence in sentences: + get_tone_ids = False + if am_name == 'speedyspeech': + get_tone_ids = True + if args.lang == 'zh': + input_ids = frontend.get_input_ids( + sentence, merge_sentences=True, get_tone_ids=get_tone_ids) + phone_ids = input_ids["phone_ids"] + phone_ids = phone_ids[0] + if get_tone_ids: + tone_ids = input_ids["tone_ids"] + tone_ids = tone_ids[0] + elif args.lang == 'en': + input_ids = frontend.get_input_ids(sentence) + phone_ids = input_ids["phone_ids"] + else: + print("lang should in {'zh', 'en'}!") + + with paddle.no_grad(): + # acoustic model + if am_name == 'fastspeech2': + # multi speaker + if am_dataset in {"aishell3", "vctk"}: + spk_id = paddle.to_tensor(args.spk_id) + mel = am_inference(phone_ids, spk_id) + else: + mel = am_inference(phone_ids) + elif am_name == 'speedyspeech': + mel = am_inference(phone_ids, tone_ids) + # vocoder + wav = voc_inference(mel) + sf.write( + str(output_dir / (utt_id + ".wav")), + wav.numpy(), + samplerate=am_config.fs) + print(f"{utt_id} done!") + + +def main(): + # parse args and config and redirect to train_sp + parser = argparse.ArgumentParser( + description="Synthesize with acoustic model & vocoder") + # acoustic model + parser.add_argument( + '--am', + type=str, + default='fastspeech2_csmsc', + choices=[ + 'speedyspeech_csmsc', 'fastspeech2_csmsc', 'fastspeech2_ljspeech', + 'fastspeech2_aishell3', 'fastspeech2_vctk' + ], + help='Choose acoustic model type of tts task.') + parser.add_argument( + '--am_config', + type=str, + default=None, + help='Config of acoustic model. Use deault config when it is None.') + parser.add_argument( + '--am_ckpt', + type=str, + default=None, + help='Checkpoint file of acoustic model.') + parser.add_argument( + "--am_stat", + type=str, + default=None, + help="mean and standard deviation used to normalize spectrogram when training acoustic model." + ) + parser.add_argument( + "--phones_dict", type=str, default=None, help="phone vocabulary file.") + parser.add_argument( + "--tones_dict", type=str, default=None, help="tone vocabulary file.") + parser.add_argument( + "--speaker_dict", type=str, default=None, help="speaker id map file.") + parser.add_argument( + '--spk_id', + type=int, + default=0, + help='spk id for multi speaker acoustic model') + # vocoder + parser.add_argument( + '--voc', + type=str, + default='pwgan_csmsc', + choices=[ + 'pwgan_csmsc', 'pwgan_ljspeech', 'pwgan_aishell3', 'pwgan_vctk', + 'mb_melgan_csmsc', 'style_melgan_csmsc', 'hifigan_csmsc' + ], + help='Choose vocoder type of tts task.') + + parser.add_argument( + '--voc_config', + type=str, + default=None, + help='Config of voc. Use deault config when it is None.') + parser.add_argument( + '--voc_ckpt', type=str, default=None, help='Checkpoint file of voc.') + parser.add_argument( + "--voc_stat", + type=str, + default=None, + help="mean and standard deviation used to normalize spectrogram when training voc." + ) + # other + parser.add_argument( + '--lang', + type=str, + default='zh', + help='Choose model language. zh or en') + + parser.add_argument( + "--inference_dir", + type=str, + default=None, + help="dir to save inference models") + parser.add_argument( + "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.") + parser.add_argument( + "--text", + type=str, + help="text to synthesize, a 'utt_id sentence' pair per line.") + parser.add_argument("--output_dir", type=str, help="output dir.") + + args = parser.parse_args() + + if args.ngpu == 0: + paddle.set_device("cpu") + elif args.ngpu > 0: + paddle.set_device("gpu") + else: + print("ngpu should >= 0 !") + + evaluate(args) + + +if __name__ == "__main__": + main() diff --git a/paddlespeech/t2s/models/__init__.py b/paddlespeech/t2s/models/__init__.py index 601bd9d6845d09ab525a5252b24db6c4d311707d..f268a4e3359ecbee3a3b478b7cb94c31b145487e 100644 --- a/paddlespeech/t2s/models/__init__.py +++ b/paddlespeech/t2s/models/__init__.py @@ -12,6 +12,7 @@ # See the License for the specific language governing permissions and # limitations under the License. from .fastspeech2 import * +from .hifigan import * from .melgan import * from .parallel_wavegan import * from .speedyspeech import * diff --git a/paddlespeech/t2s/models/hifigan/__init__.py b/paddlespeech/t2s/models/hifigan/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..7aa5e9d780d252c43c0e180278c4906023b73a77 --- /dev/null +++ b/paddlespeech/t2s/models/hifigan/__init__.py @@ -0,0 +1,15 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from .hifigan import * +from .hifigan_updater import * diff --git a/paddlespeech/t2s/models/hifigan/hifigan.py b/paddlespeech/t2s/models/hifigan/hifigan.py new file mode 100644 index 0000000000000000000000000000000000000000..82dd66c175751533588f8f4b2301c9fd3c183059 --- /dev/null +++ b/paddlespeech/t2s/models/hifigan/hifigan.py @@ -0,0 +1,779 @@ +# -*- coding: utf-8 -*- +"""HiFi-GAN Modules. +This code is based on https://github.com/jik876/hifi-gan. +""" +import copy +from typing import Any +from typing import Dict +from typing import List + +import paddle +import paddle.nn.functional as F +from paddle import nn + +from paddlespeech.t2s.modules.activation import get_activation +from paddlespeech.t2s.modules.nets_utils import initialize +from paddlespeech.t2s.modules.residual_block import HiFiGANResidualBlock as ResidualBlock + + +class HiFiGANGenerator(nn.Layer): + """HiFiGAN generator module.""" + + def __init__( + self, + in_channels: int=80, + out_channels: int=1, + channels: int=512, + kernel_size: int=7, + upsample_scales: List[int]=(8, 8, 2, 2), + upsample_kernel_sizes: List[int]=(16, 16, 4, 4), + resblock_kernel_sizes: List[int]=(3, 7, 11), + resblock_dilations: List[List[int]]=[(1, 3, 5), (1, 3, 5), + (1, 3, 5)], + use_additional_convs: bool=True, + bias: bool=True, + nonlinear_activation: str="leakyrelu", + nonlinear_activation_params: Dict[str, Any]={"negative_slope": 0.1}, + use_weight_norm: bool=True, + init_type: str="xavier_uniform", ): + """Initialize HiFiGANGenerator module. + Parameters + ---------- + in_channels : int + Number of input channels. + out_channels : int + Number of output channels. + channels : int + Number of hidden representation channels. + kernel_size : int + Kernel size of initial and final conv layer. + upsample_scales : list + List of upsampling scales. + upsample_kernel_sizes : list + List of kernel sizes for upsampling layers. + resblock_kernel_sizes : list + List of kernel sizes for residual blocks. + resblock_dilations : list + List of dilation list for residual blocks. + use_additional_convs : bool + Whether to use additional conv layers in residual blocks. + bias : bool + Whether to add bias parameter in convolution layers. + nonlinear_activation : str + Activation function module name. + nonlinear_activation_params : dict + Hyperparameters for activation function. + use_weight_norm : bool + Whether to use weight norm. + If set to true, it will be applied to all of the conv layers. + """ + super().__init__() + + # initialize parameters + initialize(self, init_type) + + # check hyperparameters are valid + assert kernel_size % 2 == 1, "Kernel size must be odd number." + assert len(upsample_scales) == len(upsample_kernel_sizes) + assert len(resblock_dilations) == len(resblock_kernel_sizes) + + # define modules + self.num_upsamples = len(upsample_kernel_sizes) + self.num_blocks = len(resblock_kernel_sizes) + self.input_conv = nn.Conv1D( + in_channels, + channels, + kernel_size, + 1, + padding=(kernel_size - 1) // 2, ) + self.upsamples = nn.LayerList() + self.blocks = nn.LayerList() + for i in range(len(upsample_kernel_sizes)): + assert upsample_kernel_sizes[i] == 2 * upsample_scales[i] + self.upsamples.append( + nn.Sequential( + get_activation(nonlinear_activation, ** + nonlinear_activation_params), + nn.Conv1DTranspose( + channels // (2**i), + channels // (2**(i + 1)), + upsample_kernel_sizes[i], + upsample_scales[i], + padding=upsample_scales[i] // 2 + upsample_scales[i] % + 2, + output_padding=upsample_scales[i] % 2, ), )) + for j in range(len(resblock_kernel_sizes)): + self.blocks.append( + ResidualBlock( + kernel_size=resblock_kernel_sizes[j], + channels=channels // (2**(i + 1)), + dilations=resblock_dilations[j], + bias=bias, + use_additional_convs=use_additional_convs, + nonlinear_activation=nonlinear_activation, + nonlinear_activation_params=nonlinear_activation_params, + )) + self.output_conv = nn.Sequential( + nn.LeakyReLU(), + nn.Conv1D( + channels // (2**(i + 1)), + out_channels, + kernel_size, + 1, + padding=(kernel_size - 1) // 2, ), + nn.Tanh(), ) + + nn.initializer.set_global_initializer(None) + + # apply weight norm + if use_weight_norm: + self.apply_weight_norm() + + # reset parameters + self.reset_parameters() + + def forward(self, c): + """Calculate forward propagation. + Parameters + ---------- + c : Tensor + Input tensor (B, in_channels, T). + Returns + ---------- + Tensor + Output tensor (B, out_channels, T). + """ + c = self.input_conv(c) + for i in range(self.num_upsamples): + c = self.upsamples[i](c) + # initialize + cs = 0.0 + for j in range(self.num_blocks): + cs += self.blocks[i * self.num_blocks + j](c) + c = cs / self.num_blocks + c = self.output_conv(c) + + return c + + def reset_parameters(self): + """Reset parameters. + This initialization follows official implementation manner. + https://github.com/jik876/hifi-gan/blob/master/models.py + """ + # 定义参数为float的正态分布。 + dist = paddle.distribution.Normal(loc=0.0, scale=0.01) + + def _reset_parameters(m): + if isinstance(m, nn.Conv1D) or isinstance(m, nn.Conv1DTranspose): + w = dist.sample(m.weight.shape) + m.weight.set_value(w) + + self.apply(_reset_parameters) + + def apply_weight_norm(self): + """Recursively apply weight normalization to all the Convolution layers + in the sublayers. + """ + + def _apply_weight_norm(layer): + if isinstance(layer, (nn.Conv1D, nn.Conv2D, nn.Conv1DTranspose)): + nn.utils.weight_norm(layer) + + self.apply(_apply_weight_norm) + + def remove_weight_norm(self): + """Recursively remove weight normalization from all the Convolution + layers in the sublayers. + """ + + def _remove_weight_norm(layer): + try: + nn.utils.remove_weight_norm(layer) + except ValueError: + pass + + self.apply(_remove_weight_norm) + + def inference(self, c): + """Perform inference. + Parameters + ---------- + c : Tensor + Input tensor (T, in_channels). + normalize_before (bool): Whether to perform normalization. + Returns + ---------- + Tensor + Output tensor (T ** prod(upsample_scales), out_channels). + """ + c = self.forward(c.transpose([1, 0]).unsqueeze(0)) + return c.squeeze(0).transpose([1, 0]) + + +class HiFiGANPeriodDiscriminator(nn.Layer): + """HiFiGAN period discriminator module.""" + + def __init__( + self, + in_channels: int=1, + out_channels: int=1, + period: int=3, + kernel_sizes: List[int]=[5, 3], + channels: int=32, + downsample_scales: List[int]=[3, 3, 3, 3, 1], + max_downsample_channels: int=1024, + bias: bool=True, + nonlinear_activation: str="leakyrelu", + nonlinear_activation_params: Dict[str, Any]={"negative_slope": 0.1}, + use_weight_norm: bool=True, + use_spectral_norm: bool=False, + init_type: str="xavier_uniform", ): + """Initialize HiFiGANPeriodDiscriminator module. + Parameters + ---------- + in_channels : int + Number of input channels. + out_channels : int + Number of output channels. + period : int + Period. + kernel_sizes : list + Kernel sizes of initial conv layers and the final conv layer. + channels : int + Number of initial channels. + downsample_scales : list + List of downsampling scales. + max_downsample_channels : int + Number of maximum downsampling channels. + use_additional_convs : bool + Whether to use additional conv layers in residual blocks. + bias : bool + Whether to add bias parameter in convolution layers. + nonlinear_activation : str + Activation function module name. + nonlinear_activation_params : dict + Hyperparameters for activation function. + use_weight_norm : bool + Whether to use weight norm. + If set to true, it will be applied to all of the conv layers. + use_spectral_norm : bool + Whether to use spectral norm. + If set to true, it will be applied to all of the conv layers. + """ + super().__init__() + + # initialize parameters + initialize(self, init_type) + + assert len(kernel_sizes) == 2 + assert kernel_sizes[0] % 2 == 1, "Kernel size must be odd number." + assert kernel_sizes[1] % 2 == 1, "Kernel size must be odd number." + + self.period = period + self.convs = nn.LayerList() + in_chs = in_channels + out_chs = channels + for downsample_scale in downsample_scales: + self.convs.append( + nn.Sequential( + nn.Conv2D( + in_chs, + out_chs, + (kernel_sizes[0], 1), + (downsample_scale, 1), + padding=((kernel_sizes[0] - 1) // 2, 0), ), + get_activation(nonlinear_activation, ** + nonlinear_activation_params), )) + in_chs = out_chs + # NOTE: Use downsample_scale + 1? + out_chs = min(out_chs * 4, max_downsample_channels) + self.output_conv = nn.Conv2D( + out_chs, + out_channels, + (kernel_sizes[1] - 1, 1), + 1, + padding=((kernel_sizes[1] - 1) // 2, 0), ) + + if use_weight_norm and use_spectral_norm: + raise ValueError("Either use use_weight_norm or use_spectral_norm.") + + # apply weight norm + if use_weight_norm: + self.apply_weight_norm() + + # apply spectral norm + if use_spectral_norm: + self.apply_spectral_norm() + + def forward(self, x): + """Calculate forward propagation. + Parameters + ---------- + c : Tensor + Input tensor (B, in_channels, T). + Returns + ---------- + list + List of each layer's tensors. + """ + # transform 1d to 2d -> (B, C, T/P, P) + b, c, t = paddle.shape(x) + if t % self.period != 0: + n_pad = self.period - (t % self.period) + x = F.pad(x, (0, n_pad), "reflect", data_format="NCL") + t += n_pad + x = x.reshape([b, c, t // self.period, self.period]) + + # forward conv + outs = [] + for layer in self.convs: + x = layer(x) + outs += [x] + x = self.output_conv(x) + x = paddle.flatten(x, 1, -1) + outs += [x] + + return outs + + def apply_weight_norm(self): + """Recursively apply weight normalization to all the Convolution layers + in the sublayers. + """ + + def _apply_weight_norm(layer): + if isinstance(layer, (nn.Conv1D, nn.Conv2D, nn.Conv1DTranspose)): + nn.utils.weight_norm(layer) + + self.apply(_apply_weight_norm) + + def apply_spectral_norm(self): + """Apply spectral normalization module from all of the layers.""" + + def _apply_spectral_norm(m): + if isinstance(m, nn.Conv2D): + nn.utils.spectral_norm(m) + + self.apply(_apply_spectral_norm) + + +class HiFiGANMultiPeriodDiscriminator(nn.Layer): + """HiFiGAN multi-period discriminator module.""" + + def __init__( + self, + periods: List[int]=[2, 3, 5, 7, 11], + discriminator_params: Dict[str, Any]={ + "in_channels": 1, + "out_channels": 1, + "kernel_sizes": [5, 3], + "channels": 32, + "downsample_scales": [3, 3, 3, 3, 1], + "max_downsample_channels": 1024, + "bias": True, + "nonlinear_activation": "leakyrelu", + "nonlinear_activation_params": { + "negative_slope": 0.1 + }, + "use_weight_norm": True, + "use_spectral_norm": False, + }, + init_type: str="xavier_uniform", ): + """Initialize HiFiGANMultiPeriodDiscriminator module. + Parameters + ---------- + periods : list + List of periods. + discriminator_params : dict + Parameters for hifi-gan period discriminator module. + The period parameter will be overwritten. + """ + super().__init__() + # initialize parameters + initialize(self, init_type) + + self.discriminators = nn.LayerList() + for period in periods: + params = copy.deepcopy(discriminator_params) + params["period"] = period + self.discriminators.append(HiFiGANPeriodDiscriminator(**params)) + + def forward(self, x): + """Calculate forward propagation. + Parameters + ---------- + x : Tensor + Input noise signal (B, 1, T). + Returns + ---------- + List + List of list of each discriminator outputs, which consists of each layer output tensors. + """ + outs = [] + for f in self.discriminators: + outs += [f(x)] + + return outs + + +class HiFiGANScaleDiscriminator(nn.Layer): + """HiFi-GAN scale discriminator module.""" + + def __init__( + self, + in_channels: int=1, + out_channels: int=1, + kernel_sizes: List[int]=[15, 41, 5, 3], + channels: int=128, + max_downsample_channels: int=1024, + max_groups: int=16, + bias: bool=True, + downsample_scales: List[int]=[2, 2, 4, 4, 1], + nonlinear_activation: str="leakyrelu", + nonlinear_activation_params: Dict[str, Any]={"negative_slope": 0.1}, + use_weight_norm: bool=True, + use_spectral_norm: bool=False, + init_type: str="xavier_uniform", ): + """Initilize HiFiGAN scale discriminator module. + Parameters + ---------- + in_channels : int + Number of input channels. + out_channels : int + Number of output channels. + kernel_sizes : list + List of four kernel sizes. The first will be used for the first conv layer, + and the second is for downsampling part, and the remaining two are for output layers. + channels : int + Initial number of channels for conv layer. + max_downsample_channels : int + Maximum number of channels for downsampling layers. + bias : bool + Whether to add bias parameter in convolution layers. + downsample_scales : list + List of downsampling scales. + nonlinear_activation : str + Activation function module name. + nonlinear_activation_params : dict + Hyperparameters for activation function. + use_weight_norm : bool + Whether to use weight norm. + If set to true, it will be applied to all of the conv layers. + use_spectral_norm : bool + Whether to use spectral norm. + If set to true, it will be applied to all of the conv layers. + """ + super().__init__() + + # initialize parameters + initialize(self, init_type) + + self.layers = nn.LayerList() + + # check kernel size is valid + assert len(kernel_sizes) == 4 + for ks in kernel_sizes: + assert ks % 2 == 1 + + # add first layer + self.layers.append( + nn.Sequential( + nn.Conv1D( + in_channels, + channels, + # NOTE: Use always the same kernel size + kernel_sizes[0], + bias_attr=bias, + padding=(kernel_sizes[0] - 1) // 2, ), + get_activation(nonlinear_activation, ** + nonlinear_activation_params), )) + + # add downsample layers + in_chs = channels + out_chs = channels + # NOTE(kan-bayashi): Remove hard coding? + groups = 4 + for downsample_scale in downsample_scales: + self.layers.append( + nn.Sequential( + nn.Conv1D( + in_chs, + out_chs, + kernel_size=kernel_sizes[1], + stride=downsample_scale, + padding=(kernel_sizes[1] - 1) // 2, + groups=groups, + bias_attr=bias, ), + get_activation(nonlinear_activation, ** + nonlinear_activation_params), )) + in_chs = out_chs + # NOTE: Remove hard coding? + out_chs = min(in_chs * 2, max_downsample_channels) + # NOTE: Remove hard coding? + groups = min(groups * 4, max_groups) + + # add final layers + out_chs = min(in_chs * 2, max_downsample_channels) + self.layers.append( + nn.Sequential( + nn.Conv1D( + in_chs, + out_chs, + kernel_size=kernel_sizes[2], + stride=1, + padding=(kernel_sizes[2] - 1) // 2, + bias_attr=bias, ), + get_activation(nonlinear_activation, ** + nonlinear_activation_params), )) + self.layers.append( + nn.Conv1D( + out_chs, + out_channels, + kernel_size=kernel_sizes[3], + stride=1, + padding=(kernel_sizes[3] - 1) // 2, + bias_attr=bias, ), ) + + if use_weight_norm and use_spectral_norm: + raise ValueError("Either use use_weight_norm or use_spectral_norm.") + + # apply weight norm + if use_weight_norm: + self.apply_weight_norm() + + # apply spectral norm + if use_spectral_norm: + self.apply_spectral_norm() + + def forward(self, x): + """Calculate forward propagation. + Parameters + ---------- + x : Tensor + Input noise signal (B, 1, T). + Returns + ---------- + List + List of output tensors of each layer. + """ + outs = [] + for f in self.layers: + x = f(x) + outs += [x] + + return outs + + def apply_weight_norm(self): + """Recursively apply weight normalization to all the Convolution layers + in the sublayers. + """ + + def _apply_weight_norm(layer): + if isinstance(layer, (nn.Conv1D, nn.Conv2D, nn.Conv1DTranspose)): + nn.utils.weight_norm(layer) + + self.apply(_apply_weight_norm) + + def apply_spectral_norm(self): + """Apply spectral normalization module from all of the layers.""" + + def _apply_spectral_norm(m): + if isinstance(m, nn.Conv2D): + nn.utils.spectral_norm(m) + + self.apply(_apply_spectral_norm) + + +class HiFiGANMultiScaleDiscriminator(nn.Layer): + """HiFi-GAN multi-scale discriminator module.""" + + def __init__( + self, + scales: int=3, + downsample_pooling: str="AvgPool1D", + # follow the official implementation setting + downsample_pooling_params: Dict[str, Any]={ + "kernel_size": 4, + "stride": 2, + "padding": 2, + }, + discriminator_params: Dict[str, Any]={ + "in_channels": 1, + "out_channels": 1, + "kernel_sizes": [15, 41, 5, 3], + "channels": 128, + "max_downsample_channels": 1024, + "max_groups": 16, + "bias": True, + "downsample_scales": [2, 2, 4, 4, 1], + "nonlinear_activation": "leakyrelu", + "nonlinear_activation_params": { + "negative_slope": 0.1 + }, + }, + follow_official_norm: bool=False, + init_type: str="xavier_uniform", ): + """Initilize HiFiGAN multi-scale discriminator module. + Parameters + ---------- + scales : int + Number of multi-scales. + downsample_pooling : str + Pooling module name for downsampling of the inputs. + downsample_pooling_params : dict + Parameters for the above pooling module. + discriminator_params : dict + Parameters for hifi-gan scale discriminator module. + follow_official_norm : bool + Whether to follow the norm setting of the official + implementaion. The first discriminator uses spectral norm and the other + discriminators use weight norm. + """ + super().__init__() + + # initialize parameters + initialize(self, init_type) + + self.discriminators = nn.LayerList() + + # add discriminators + for i in range(scales): + params = copy.deepcopy(discriminator_params) + if follow_official_norm: + if i == 0: + params["use_weight_norm"] = False + params["use_spectral_norm"] = True + else: + params["use_weight_norm"] = True + params["use_spectral_norm"] = False + self.discriminators.append(HiFiGANScaleDiscriminator(**params)) + self.pooling = getattr(nn, downsample_pooling)( + **downsample_pooling_params) + + def forward(self, x): + """Calculate forward propagation. + Parameters + ---------- + x : Tensor + Input noise signal (B, 1, T). + Returns + ---------- + List + List of list of each discriminator outputs, which consists of each layer output tensors. + """ + outs = [] + for f in self.discriminators: + outs += [f(x)] + x = self.pooling(x) + + return outs + + +class HiFiGANMultiScaleMultiPeriodDiscriminator(nn.Layer): + """HiFi-GAN multi-scale + multi-period discriminator module.""" + + def __init__( + self, + # Multi-scale discriminator related + scales: int=3, + scale_downsample_pooling: str="AvgPool1D", + scale_downsample_pooling_params: Dict[str, Any]={ + "kernel_size": 4, + "stride": 2, + "padding": 2, + }, + scale_discriminator_params: Dict[str, Any]={ + "in_channels": 1, + "out_channels": 1, + "kernel_sizes": [15, 41, 5, 3], + "channels": 128, + "max_downsample_channels": 1024, + "max_groups": 16, + "bias": True, + "downsample_scales": [2, 2, 4, 4, 1], + "nonlinear_activation": "leakyrelu", + "nonlinear_activation_params": { + "negative_slope": 0.1 + }, + }, + follow_official_norm: bool=True, + # Multi-period discriminator related + periods: List[int]=[2, 3, 5, 7, 11], + period_discriminator_params: Dict[str, Any]={ + "in_channels": 1, + "out_channels": 1, + "kernel_sizes": [5, 3], + "channels": 32, + "downsample_scales": [3, 3, 3, 3, 1], + "max_downsample_channels": 1024, + "bias": True, + "nonlinear_activation": "leakyrelu", + "nonlinear_activation_params": { + "negative_slope": 0.1 + }, + "use_weight_norm": True, + "use_spectral_norm": False, + }, + init_type: str="xavier_uniform", ): + """Initilize HiFiGAN multi-scale + multi-period discriminator module. + Parameters + ---------- + scales : int + Number of multi-scales. + scale_downsample_pooling : str + Pooling module name for downsampling of the inputs. + scale_downsample_pooling_params : dict + Parameters for the above pooling module. + scale_discriminator_params : dict + Parameters for hifi-gan scale discriminator module. + follow_official_norm : bool): Whether to follow the norm setting of the official + implementaion. The first discriminator uses spectral norm and the other + discriminators use weight norm. + periods : list + List of periods. + period_discriminator_params : dict + Parameters for hifi-gan period discriminator module. + The period parameter will be overwritten. + """ + super().__init__() + + # initialize parameters + initialize(self, init_type) + + self.msd = HiFiGANMultiScaleDiscriminator( + scales=scales, + downsample_pooling=scale_downsample_pooling, + downsample_pooling_params=scale_downsample_pooling_params, + discriminator_params=scale_discriminator_params, + follow_official_norm=follow_official_norm, ) + self.mpd = HiFiGANMultiPeriodDiscriminator( + periods=periods, + discriminator_params=period_discriminator_params, ) + + def forward(self, x): + """Calculate forward propagation. + Parameters + ---------- + x : Tensor + Input noise signal (B, 1, T). + Returns + ---------- + List: + List of list of each discriminator outputs, + which consists of each layer output tensors. + Multi scale and multi period ones are concatenated. + """ + msd_outs = self.msd(x) + mpd_outs = self.mpd(x) + return msd_outs + mpd_outs + + +class HiFiGANInference(nn.Layer): + def __init__(self, normalizer, hifigan_generator): + super().__init__() + self.normalizer = normalizer + self.hifigan_generator = hifigan_generator + + def forward(self, logmel): + normalized_mel = self.normalizer(logmel) + wav = self.hifigan_generator.inference(normalized_mel) + return wav diff --git a/paddlespeech/t2s/models/hifigan/hifigan_updater.py b/paddlespeech/t2s/models/hifigan/hifigan_updater.py new file mode 100644 index 0000000000000000000000000000000000000000..f12c666fd3a3ab08fa404466ccac39affcf8f43e --- /dev/null +++ b/paddlespeech/t2s/models/hifigan/hifigan_updater.py @@ -0,0 +1,247 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import logging +from typing import Dict + +import paddle +from paddle import distributed as dist +from paddle.io import DataLoader +from paddle.nn import Layer +from paddle.optimizer import Optimizer +from paddle.optimizer.lr import LRScheduler + +from paddlespeech.t2s.training.extensions.evaluator import StandardEvaluator +from paddlespeech.t2s.training.reporter import report +from paddlespeech.t2s.training.updaters.standard_updater import StandardUpdater +from paddlespeech.t2s.training.updaters.standard_updater import UpdaterState +logging.basicConfig( + format='%(asctime)s [%(levelname)s] [%(filename)s:%(lineno)d] %(message)s', + datefmt='[%Y-%m-%d %H:%M:%S]') +logger = logging.getLogger(__name__) +logger.setLevel(logging.INFO) + + +class HiFiGANUpdater(StandardUpdater): + def __init__(self, + models: Dict[str, Layer], + optimizers: Dict[str, Optimizer], + criterions: Dict[str, Layer], + schedulers: Dict[str, LRScheduler], + dataloader: DataLoader, + generator_train_start_steps: int=0, + discriminator_train_start_steps: int=100000, + lambda_adv: float=1.0, + lambda_aux: float=1.0, + lambda_feat_match: float=1.0, + output_dir=None): + self.models = models + self.generator: Layer = models['generator'] + self.discriminator: Layer = models['discriminator'] + + self.optimizers = optimizers + self.optimizer_g: Optimizer = optimizers['generator'] + self.optimizer_d: Optimizer = optimizers['discriminator'] + + self.criterions = criterions + self.criterion_feat_match = criterions['feat_match'] + self.criterion_mel = criterions['mel'] + + self.criterion_gen_adv = criterions["gen_adv"] + self.criterion_dis_adv = criterions["dis_adv"] + + self.schedulers = schedulers + self.scheduler_g = schedulers['generator'] + self.scheduler_d = schedulers['discriminator'] + + self.dataloader = dataloader + + self.generator_train_start_steps = generator_train_start_steps + self.discriminator_train_start_steps = discriminator_train_start_steps + self.lambda_adv = lambda_adv + self.lambda_aux = lambda_aux + self.lambda_feat_match = lambda_feat_match + + self.state = UpdaterState(iteration=0, epoch=0) + self.train_iterator = iter(self.dataloader) + + log_file = output_dir / 'worker_{}.log'.format(dist.get_rank()) + self.filehandler = logging.FileHandler(str(log_file)) + logger.addHandler(self.filehandler) + self.logger = logger + self.msg = "" + + def update_core(self, batch): + self.msg = "Rank: {}, ".format(dist.get_rank()) + losses_dict = {} + # parse batch + wav, mel = batch + + # Generator + if self.state.iteration > self.generator_train_start_steps: + # (B, out_channels, T ** prod(upsample_scales) + wav_ = self.generator(mel) + + # initialize + gen_loss = 0.0 + aux_loss = 0.0 + + # mel spectrogram loss + mel_loss = self.criterion_mel(wav_, wav) + aux_loss += mel_loss + report("train/mel_loss", float(mel_loss)) + losses_dict["mel_loss"] = float(mel_loss) + + gen_loss += aux_loss * self.lambda_aux + + # adversarial loss + if self.state.iteration > self.discriminator_train_start_steps: + p_ = self.discriminator(wav_) + adv_loss = self.criterion_gen_adv(p_) + report("train/adversarial_loss", float(adv_loss)) + losses_dict["adversarial_loss"] = float(adv_loss) + + # feature matching loss + # no need to track gradients + with paddle.no_grad(): + p = self.discriminator(wav) + fm_loss = self.criterion_feat_match(p_, p) + report("train/feature_matching_loss", float(fm_loss)) + losses_dict["feature_matching_loss"] = float(fm_loss) + + adv_loss += self.lambda_feat_match * fm_loss + + gen_loss += self.lambda_adv * adv_loss + + report("train/generator_loss", float(gen_loss)) + losses_dict["generator_loss"] = float(gen_loss) + + self.optimizer_g.clear_grad() + gen_loss.backward() + + self.optimizer_g.step() + self.scheduler_g.step() + + # Disctiminator + if self.state.iteration > self.discriminator_train_start_steps: + # re-compute wav_ which leads better quality + with paddle.no_grad(): + wav_ = self.generator(mel) + + p = self.discriminator(wav) + p_ = self.discriminator(wav_.detach()) + real_loss, fake_loss = self.criterion_dis_adv(p_, p) + dis_loss = real_loss + fake_loss + report("train/real_loss", float(real_loss)) + report("train/fake_loss", float(fake_loss)) + report("train/discriminator_loss", float(dis_loss)) + losses_dict["real_loss"] = float(real_loss) + losses_dict["fake_loss"] = float(fake_loss) + losses_dict["discriminator_loss"] = float(dis_loss) + + self.optimizer_d.clear_grad() + dis_loss.backward() + + self.optimizer_d.step() + self.scheduler_d.step() + + self.msg += ', '.join('{}: {:>.6f}'.format(k, v) + for k, v in losses_dict.items()) + + +class HiFiGANEvaluator(StandardEvaluator): + def __init__(self, + models: Dict[str, Layer], + criterions: Dict[str, Layer], + dataloader: DataLoader, + lambda_adv: float=1.0, + lambda_aux: float=1.0, + lambda_feat_match: float=1.0, + output_dir=None): + self.models = models + self.generator = models['generator'] + self.discriminator = models['discriminator'] + + self.criterions = criterions + self.criterion_feat_match = criterions['feat_match'] + self.criterion_mel = criterions['mel'] + self.criterion_gen_adv = criterions["gen_adv"] + self.criterion_dis_adv = criterions["dis_adv"] + + self.dataloader = dataloader + + self.lambda_adv = lambda_adv + self.lambda_aux = lambda_aux + self.lambda_feat_match = lambda_feat_match + + log_file = output_dir / 'worker_{}.log'.format(dist.get_rank()) + self.filehandler = logging.FileHandler(str(log_file)) + logger.addHandler(self.filehandler) + self.logger = logger + self.msg = "" + + def evaluate_core(self, batch): + # logging.debug("Evaluate: ") + self.msg = "Evaluate: " + losses_dict = {} + wav, mel = batch + + # Generator + # (B, out_channels, T ** prod(upsample_scales) + wav_ = self.generator(mel) + + # initialize + gen_loss = 0.0 + aux_loss = 0.0 + + ## Adversarial loss + p_ = self.discriminator(wav_) + adv_loss = self.criterion_gen_adv(p_) + report("eval/adversarial_loss", float(adv_loss)) + losses_dict["adversarial_loss"] = float(adv_loss) + + # feature matching loss + p = self.discriminator(wav) + fm_loss = self.criterion_feat_match(p_, p) + report("eval/feature_matching_loss", float(fm_loss)) + losses_dict["feature_matching_loss"] = float(fm_loss) + adv_loss += self.lambda_feat_match * fm_loss + + gen_loss += self.lambda_adv * adv_loss + + # mel spectrogram loss + mel_loss = self.criterion_mel(wav_, wav) + aux_loss += mel_loss + report("eval/mel_loss", float(mel_loss)) + losses_dict["mel_loss"] = float(mel_loss) + + gen_loss += aux_loss * self.lambda_aux + + report("eval/generator_loss", float(gen_loss)) + losses_dict["generator_loss"] = float(gen_loss) + + # Disctiminator + p = self.discriminator(wav) + real_loss, fake_loss = self.criterion_dis_adv(p_, p) + dis_loss = real_loss + fake_loss + report("eval/real_loss", float(real_loss)) + report("eval/fake_loss", float(fake_loss)) + report("eval/discriminator_loss", float(dis_loss)) + + losses_dict["real_loss"] = float(real_loss) + losses_dict["fake_loss"] = float(fake_loss) + losses_dict["discriminator_loss"] = float(dis_loss) + + self.msg += ', '.join('{}: {:>.6f}'.format(k, v) + for k, v in losses_dict.items()) + self.logger.info(self.msg) diff --git a/paddlespeech/t2s/models/melgan/melgan.py b/paddlespeech/t2s/models/melgan/melgan.py index 32fcf65882b51808d69e1422bc4f46df07ba115a..3e90b69154729ca6de6b0fe8cec54d3b0692ca49 100644 --- a/paddlespeech/t2s/models/melgan/melgan.py +++ b/paddlespeech/t2s/models/melgan/melgan.py @@ -93,7 +93,8 @@ class MelGANGenerator(nn.Layer): initialize(self, init_type) # for compatibility - nonlinear_activation = nonlinear_activation.lower() + if nonlinear_activation: + nonlinear_activation = nonlinear_activation.lower() # check hyper parameters is valid assert channels >= np.prod(upsample_scales) @@ -101,6 +102,7 @@ class MelGANGenerator(nn.Layer): if not use_causal_conv: assert (kernel_size - 1 ) % 2 == 0, "Not support even number kernel size." + layers = [] if not use_causal_conv: layers += [ @@ -327,7 +329,8 @@ class MelGANDiscriminator(nn.Layer): super().__init__() # for compatibility - nonlinear_activation = nonlinear_activation.lower() + if nonlinear_activation: + nonlinear_activation = nonlinear_activation.lower() # initialize parameters initialize(self, init_type) @@ -476,8 +479,9 @@ class MelGANMultiScaleDiscriminator(nn.Layer): # initialize parameters initialize(self, init_type) - # for compatibility - nonlinear_activation = nonlinear_activation.lower() + # for + if nonlinear_activation: + nonlinear_activation = nonlinear_activation.lower() self.discriminators = nn.LayerList() diff --git a/paddlespeech/t2s/models/melgan/multi_band_melgan_updater.py b/paddlespeech/t2s/models/melgan/multi_band_melgan_updater.py index 75e99627a711e055b22ab24600ba7a016a577748..1c6c34c2a7c0c24c34278c287e18bb5d4c0cf139 100644 --- a/paddlespeech/t2s/models/melgan/multi_band_melgan_updater.py +++ b/paddlespeech/t2s/models/melgan/multi_band_melgan_updater.py @@ -40,8 +40,10 @@ class MBMelGANUpdater(StandardUpdater): criterions: Dict[str, Layer], schedulers: Dict[str, LRScheduler], dataloader: DataLoader, - discriminator_train_start_steps: int, - lambda_adv: float, + generator_train_start_steps: int=0, + discriminator_train_start_steps: int=100000, + lambda_aux: float=1.0, + lambda_adv: float=1.0, output_dir: Path=None): self.models = models self.generator: Layer = models['generator'] @@ -64,10 +66,12 @@ class MBMelGANUpdater(StandardUpdater): self.dataloader = dataloader + self.generator_train_start_steps = generator_train_start_steps self.discriminator_train_start_steps = discriminator_train_start_steps self.lambda_adv = lambda_adv - self.state = UpdaterState(iteration=0, epoch=0) + self.lambda_aux = lambda_aux + self.state = UpdaterState(iteration=0, epoch=0) self.train_iterator = iter(self.dataloader) log_file = output_dir / 'worker_{}.log'.format(dist.get_rank()) @@ -79,57 +83,61 @@ class MBMelGANUpdater(StandardUpdater): def update_core(self, batch): self.msg = "Rank: {}, ".format(dist.get_rank()) losses_dict = {} - # parse batch wav, mel = batch - # Generator - # (B, out_channels, T ** prod(upsample_scales) - wav_ = self.generator(mel) - wav_mb_ = wav_ - # (B, 1, out_channels*T ** prod(upsample_scales) - wav_ = self.criterion_pqmf.synthesis(wav_mb_) - # initialize - gen_loss = 0.0 - - # full band Multi-resolution stft loss - sc_loss, mag_loss = self.criterion_stft(wav_, wav) - # for balancing with subband stft loss - # Eq.(9) in paper - gen_loss += 0.5 * (sc_loss + mag_loss) - report("train/spectral_convergence_loss", float(sc_loss)) - report("train/log_stft_magnitude_loss", float(mag_loss)) - losses_dict["spectral_convergence_loss"] = float(sc_loss) - losses_dict["log_stft_magnitude_loss"] = float(mag_loss) - - # sub band Multi-resolution stft loss - # (B, subbands, T // subbands) - wav_mb = self.criterion_pqmf.analysis(wav) - sub_sc_loss, sub_mag_loss = self.criterion_sub_stft(wav_mb_, wav_mb) - # Eq.(9) in paper - gen_loss += 0.5 * (sub_sc_loss + sub_mag_loss) - report("train/sub_spectral_convergence_loss", float(sub_sc_loss)) - report("train/sub_log_stft_magnitude_loss", float(sub_mag_loss)) - losses_dict["sub_spectral_convergence_loss"] = float(sub_sc_loss) - losses_dict["sub_log_stft_magnitude_loss"] = float(sub_mag_loss) - - ## Adversarial loss - if self.state.iteration > self.discriminator_train_start_steps: - p_ = self.discriminator(wav_) - adv_loss = self.criterion_gen_adv(p_) - - report("train/adversarial_loss", float(adv_loss)) - losses_dict["adversarial_loss"] = float(adv_loss) - gen_loss += self.lambda_adv * adv_loss - - report("train/generator_loss", float(gen_loss)) - losses_dict["generator_loss"] = float(gen_loss) - - self.optimizer_g.clear_grad() - gen_loss.backward() - - self.optimizer_g.step() - self.scheduler_g.step() + # Generator + if self.state.iteration > self.generator_train_start_steps: + # (B, out_channels, T ** prod(upsample_scales) + wav_ = self.generator(mel) + wav_mb_ = wav_ + # (B, 1, out_channels*T ** prod(upsample_scales) + wav_ = self.criterion_pqmf.synthesis(wav_mb_) + + # initialize + gen_loss = 0.0 + aux_loss = 0.0 + + # full band Multi-resolution stft loss + sc_loss, mag_loss = self.criterion_stft(wav_, wav) + # for balancing with subband stft loss + # Eq.(9) in paper + aux_loss += 0.5 * (sc_loss + mag_loss) + report("train/spectral_convergence_loss", float(sc_loss)) + report("train/log_stft_magnitude_loss", float(mag_loss)) + losses_dict["spectral_convergence_loss"] = float(sc_loss) + losses_dict["log_stft_magnitude_loss"] = float(mag_loss) + + # sub band Multi-resolution stft loss + # (B, subbands, T // subbands) + wav_mb = self.criterion_pqmf.analysis(wav) + sub_sc_loss, sub_mag_loss = self.criterion_sub_stft(wav_mb_, wav_mb) + # Eq.(9) in paper + aux_loss += 0.5 * (sub_sc_loss + sub_mag_loss) + report("train/sub_spectral_convergence_loss", float(sub_sc_loss)) + report("train/sub_log_stft_magnitude_loss", float(sub_mag_loss)) + losses_dict["sub_spectral_convergence_loss"] = float(sub_sc_loss) + losses_dict["sub_log_stft_magnitude_loss"] = float(sub_mag_loss) + + gen_loss += aux_loss * self.lambda_aux + + # adversarial loss + if self.state.iteration > self.discriminator_train_start_steps: + p_ = self.discriminator(wav_) + adv_loss = self.criterion_gen_adv(p_) + report("train/adversarial_loss", float(adv_loss)) + losses_dict["adversarial_loss"] = float(adv_loss) + + gen_loss += self.lambda_adv * adv_loss + + report("train/generator_loss", float(gen_loss)) + losses_dict["generator_loss"] = float(gen_loss) + + self.optimizer_g.clear_grad() + gen_loss.backward() + + self.optimizer_g.step() + self.scheduler_g.step() # Disctiminator if self.state.iteration > self.discriminator_train_start_steps: @@ -163,7 +171,8 @@ class MBMelGANEvaluator(StandardEvaluator): models: Dict[str, Layer], criterions: Dict[str, Layer], dataloader: DataLoader, - lambda_adv: float, + lambda_aux: float=1.0, + lambda_adv: float=1.0, output_dir: Path=None): self.models = models self.generator = models['generator'] @@ -177,7 +186,9 @@ class MBMelGANEvaluator(StandardEvaluator): self.criterion_dis_adv = criterions["dis_adv"] self.dataloader = dataloader + self.lambda_adv = lambda_adv + self.lambda_aux = lambda_aux log_file = output_dir / 'worker_{}.log'.format(dist.get_rank()) self.filehandler = logging.FileHandler(str(log_file)) @@ -189,8 +200,8 @@ class MBMelGANEvaluator(StandardEvaluator): # logging.debug("Evaluate: ") self.msg = "Evaluate: " losses_dict = {} - wav, mel = batch + # Generator # (B, out_channels, T ** prod(upsample_scales) wav_ = self.generator(mel) @@ -198,18 +209,22 @@ class MBMelGANEvaluator(StandardEvaluator): # (B, 1, out_channels*T ** prod(upsample_scales) wav_ = self.criterion_pqmf.synthesis(wav_mb_) - ## Adversarial loss + # initialize + gen_loss = 0.0 + aux_loss = 0.0 + + # adversarial loss p_ = self.discriminator(wav_) adv_loss = self.criterion_gen_adv(p_) - report("eval/adversarial_loss", float(adv_loss)) losses_dict["adversarial_loss"] = float(adv_loss) - gen_loss = self.lambda_adv * adv_loss + + gen_loss += self.lambda_adv * adv_loss # Multi-resolution stft loss sc_loss, mag_loss = self.criterion_stft(wav_, wav) # Eq.(9) in paper - gen_loss += 0.5 * (sc_loss + mag_loss) + aux_loss += 0.5 * (sc_loss + mag_loss) report("eval/spectral_convergence_loss", float(sc_loss)) report("eval/log_stft_magnitude_loss", float(mag_loss)) losses_dict["spectral_convergence_loss"] = float(sc_loss) @@ -220,12 +235,14 @@ class MBMelGANEvaluator(StandardEvaluator): wav_mb = self.criterion_pqmf.analysis(wav) sub_sc_loss, sub_mag_loss = self.criterion_sub_stft(wav_mb_, wav_mb) # Eq.(9) in paper - gen_loss += 0.5 * (sub_sc_loss + sub_mag_loss) + aux_loss += 0.5 * (sub_sc_loss + sub_mag_loss) report("eval/sub_spectral_convergence_loss", float(sub_sc_loss)) report("eval/sub_log_stft_magnitude_loss", float(sub_mag_loss)) losses_dict["sub_spectral_convergence_loss"] = float(sub_sc_loss) losses_dict["sub_log_stft_magnitude_loss"] = float(sub_mag_loss) + gen_loss += aux_loss * self.lambda_aux + report("eval/generator_loss", float(gen_loss)) losses_dict["generator_loss"] = float(gen_loss) diff --git a/paddlespeech/t2s/models/melgan/style_melgan.py b/paddlespeech/t2s/models/melgan/style_melgan.py index 4725a8d02dcc1401b8e0b609199072d2481cf9e5..0854c0a98125e6fd239c4a088734015c9f5537bd 100644 --- a/paddlespeech/t2s/models/melgan/style_melgan.py +++ b/paddlespeech/t2s/models/melgan/style_melgan.py @@ -14,7 +14,6 @@ # Modified from espnet(https://github.com/espnet/espnet) """StyleMelGAN Modules.""" import copy -import math from typing import Any from typing import Dict from typing import List @@ -225,14 +224,17 @@ class StyleMelGANGenerator(nn.Layer): c_shape = paddle.shape(c) # prepare noise input # there is a bug in Paddle int division, we must convert a int tensor to int here - noise_size = (1, self.in_channels, - math.ceil(int(c_shape[2]) / self.noise_upsample_factor)) + noise_T = paddle.cast( + paddle.ceil(c_shape[2] / int(self.noise_upsample_factor)), + dtype='int64') + noise_size = (1, self.in_channels, noise_T) # (1, in_channels, T/noise_upsample_factor) noise = paddle.randn(noise_size) # (1, in_channels, T) x = self.noise_upsample(noise) x_shape = paddle.shape(x) total_length = c_shape[2] * self.upsample_factor + # Dygraph to Static Graph bug here, 2021.12.15 c = F.pad( c, (0, x_shape[2] - c_shape[2]), "replicate", data_format="NCL") # c.shape[2] == x.shape[2] here @@ -243,7 +245,6 @@ class StyleMelGANGenerator(nn.Layer): return x.squeeze(0).transpose([1, 0]) -# StyleMelGANDiscriminator 不需要 remove weight norm 嘛? class StyleMelGANDiscriminator(nn.Layer): """Style MelGAN disciminator module.""" diff --git a/paddlespeech/t2s/models/parallel_wavegan/parallel_wavegan.py b/paddlespeech/t2s/models/parallel_wavegan/parallel_wavegan.py index 9b0ba47498f996ce04c4d8e20838b5f39a2563cc..9eff4497342337ce6bb4e88a15d800e5bf28dc2d 100644 --- a/paddlespeech/t2s/models/parallel_wavegan/parallel_wavegan.py +++ b/paddlespeech/t2s/models/parallel_wavegan/parallel_wavegan.py @@ -21,299 +21,11 @@ from typing import Optional import numpy as np import paddle from paddle import nn -from paddle.nn import functional as F - -class Stretch2D(nn.Layer): - def __init__(self, w_scale: int, h_scale: int, mode: str="nearest"): - """Strech an image (or image-like object) with some interpolation. - - Parameters - ---------- - w_scale : int - Scalar of width. - h_scale : int - Scalar of the height. - mode : str, optional - Interpolation mode, modes suppored are "nearest", "bilinear", - "trilinear", "bicubic", "linear" and "area",by default "nearest" - - For more details about interpolation, see - `paddle.nn.functional.interpolate `_. - """ - super().__init__() - self.w_scale = w_scale - self.h_scale = h_scale - self.mode = mode - - def forward(self, x): - """ - Parameters - ---------- - x : Tensor - Shape (N, C, H, W) - - Returns - ------- - Tensor - Shape (N, C, H', W'), where ``H'=h_scale * H``, ``W'=w_scale * W``. - The stretched image. - """ - out = F.interpolate( - x, scale_factor=(self.h_scale, self.w_scale), mode=self.mode) - return out - - -class UpsampleNet(nn.Layer): - """A Layer to upsample spectrogram by applying consecutive stretch and - convolutions. - - Parameters - ---------- - upsample_scales : List[int] - Upsampling factors for each strech. - nonlinear_activation : Optional[str], optional - Activation after each convolution, by default None - nonlinear_activation_params : Dict[str, Any], optional - Parameters passed to construct the activation, by default {} - interpolate_mode : str, optional - Interpolation mode of the strech, by default "nearest" - freq_axis_kernel_size : int, optional - Convolution kernel size along the frequency axis, by default 1 - use_causal_conv : bool, optional - Whether to use causal padding before convolution, by default False - - If True, Causal padding is used along the time axis, i.e. padding - amount is ``receptive field - 1`` and 0 for before and after, - respectively. - - If False, "same" padding is used along the time axis. - """ - - def __init__(self, - upsample_scales: List[int], - nonlinear_activation: Optional[str]=None, - nonlinear_activation_params: Dict[str, Any]={}, - interpolate_mode: str="nearest", - freq_axis_kernel_size: int=1, - use_causal_conv: bool=False): - super().__init__() - self.use_causal_conv = use_causal_conv - self.up_layers = nn.LayerList() - for scale in upsample_scales: - stretch = Stretch2D(scale, 1, interpolate_mode) - assert freq_axis_kernel_size % 2 == 1 - freq_axis_padding = (freq_axis_kernel_size - 1) // 2 - kernel_size = (freq_axis_kernel_size, scale * 2 + 1) - if use_causal_conv: - padding = (freq_axis_padding, scale * 2) - else: - padding = (freq_axis_padding, scale) - conv = nn.Conv2D( - 1, 1, kernel_size, padding=padding, bias_attr=False) - self.up_layers.extend([stretch, conv]) - if nonlinear_activation is not None: - nonlinear = getattr( - nn, nonlinear_activation)(**nonlinear_activation_params) - self.up_layers.append(nonlinear) - - def forward(self, c): - """ - Parameters - ---------- - c : Tensor - Shape (N, F, T), spectrogram - - Returns - ------- - Tensor - Shape (N, F, T'), where ``T' = upsample_factor * T``, upsampled - spectrogram - """ - c = c.unsqueeze(1) - for f in self.up_layers: - if self.use_causal_conv and isinstance(f, nn.Conv2D): - c = f(c)[:, :, :, c.shape[-1]] - else: - c = f(c) - return c.squeeze(1) - - -class ConvInUpsampleNet(nn.Layer): - """A Layer to upsample spectrogram composed of a convolution and an - UpsampleNet. - - Parameters - ---------- - upsample_scales : List[int] - Upsampling factors for each strech. - nonlinear_activation : Optional[str], optional - Activation after each convolution, by default None - nonlinear_activation_params : Dict[str, Any], optional - Parameters passed to construct the activation, by default {} - interpolate_mode : str, optional - Interpolation mode of the strech, by default "nearest" - freq_axis_kernel_size : int, optional - Convolution kernel size along the frequency axis, by default 1 - aux_channels : int, optional - Feature size of the input, by default 80 - aux_context_window : int, optional - Context window of the first 1D convolution applied to the input. It - related to the kernel size of the convolution, by default 0 - - If use causal convolution, the kernel size is ``window + 1``, else - the kernel size is ``2 * window + 1``. - use_causal_conv : bool, optional - Whether to use causal padding before convolution, by default False - - If True, Causal padding is used along the time axis, i.e. padding - amount is ``receptive field - 1`` and 0 for before and after, - respectively. - - If False, "same" padding is used along the time axis. - """ - - def __init__(self, - upsample_scales: List[int], - nonlinear_activation: Optional[str]=None, - nonlinear_activation_params: Dict[str, Any]={}, - interpolate_mode: str="nearest", - freq_axis_kernel_size: int=1, - aux_channels: int=80, - aux_context_window: int=0, - use_causal_conv: bool=False): - super().__init__() - self.aux_context_window = aux_context_window - self.use_causal_conv = use_causal_conv and aux_context_window > 0 - kernel_size = aux_context_window + 1 if use_causal_conv else 2 * aux_context_window + 1 - self.conv_in = nn.Conv1D( - aux_channels, - aux_channels, - kernel_size=kernel_size, - bias_attr=False) - self.upsample = UpsampleNet( - upsample_scales=upsample_scales, - nonlinear_activation=nonlinear_activation, - nonlinear_activation_params=nonlinear_activation_params, - interpolate_mode=interpolate_mode, - freq_axis_kernel_size=freq_axis_kernel_size, - use_causal_conv=use_causal_conv) - - def forward(self, c): - """ - Parameters - ---------- - c : Tensor - Shape (N, F, T), spectrogram - - Returns - ------- - Tensors - Shape (N, F, T'), where ``T' = upsample_factor * T``, upsampled - spectrogram - """ - c_ = self.conv_in(c) - c = c_[:, :, :-self.aux_context_window] if self.use_causal_conv else c_ - return self.upsample(c) - - -class ResidualBlock(nn.Layer): - """A gated activation unit composed of an 1D convolution, a gated tanh - unit and parametric redidual and skip connections. For more details, - refer to `WaveNet: A Generative Model for Raw Audio `_. - - Parameters - ---------- - kernel_size : int, optional - Kernel size of the 1D convolution, by default 3 - residual_channels : int, optional - Feature size of the resiaudl output(and also the input), by default 64 - gate_channels : int, optional - Output feature size of the 1D convolution, by default 128 - skip_channels : int, optional - Feature size of the skip output, by default 64 - aux_channels : int, optional - Feature size of the auxiliary input (e.g. spectrogram), by default 80 - dropout : float, optional - Probability of the dropout before the 1D convolution, by default 0. - dilation : int, optional - Dilation of the 1D convolution, by default 1 - bias : bool, optional - Whether to use bias in the 1D convolution, by default True - use_causal_conv : bool, optional - Whether to use causal padding for the 1D convolution, by default False - """ - - def __init__(self, - kernel_size: int=3, - residual_channels: int=64, - gate_channels: int=128, - skip_channels: int=64, - aux_channels: int=80, - dropout: float=0., - dilation: int=1, - bias: bool=True, - use_causal_conv: bool=False): - super().__init__() - self.dropout = dropout - if use_causal_conv: - padding = (kernel_size - 1) * dilation - else: - assert kernel_size % 2 == 1 - padding = (kernel_size - 1) // 2 * dilation - self.use_causal_conv = use_causal_conv - - self.conv = nn.Conv1D( - residual_channels, - gate_channels, - kernel_size, - padding=padding, - dilation=dilation, - bias_attr=bias) - if aux_channels is not None: - self.conv1x1_aux = nn.Conv1D( - aux_channels, gate_channels, kernel_size=1, bias_attr=False) - else: - self.conv1x1_aux = None - - gate_out_channels = gate_channels // 2 - self.conv1x1_out = nn.Conv1D( - gate_out_channels, residual_channels, kernel_size=1, bias_attr=bias) - self.conv1x1_skip = nn.Conv1D( - gate_out_channels, skip_channels, kernel_size=1, bias_attr=bias) - - def forward(self, x, c): - """ - Parameters - ---------- - x : Tensor - Shape (N, C_res, T), the input features. - c : Tensor - Shape (N, C_aux, T), the auxiliary input. - - Returns - ------- - res : Tensor - Shape (N, C_res, T), the residual output, which is used as the - input of the next ResidualBlock in a stack of ResidualBlocks. - skip : Tensor - Shape (N, C_skip, T), the skip output, which is collected among - each layer in a stack of ResidualBlocks. - """ - x_input = x - x = F.dropout(x, self.dropout, training=self.training) - x = self.conv(x) - x = x[:, :, x_input.shape[-1]] if self.use_causal_conv else x - if c is not None: - c = self.conv1x1_aux(c) - x += c - - a, b = paddle.chunk(x, 2, axis=1) - x = paddle.tanh(a) * F.sigmoid(b) - - skip = self.conv1x1_skip(x) - res = (self.conv1x1_out(x) + x_input) * math.sqrt(0.5) - return res, skip +from paddlespeech.t2s.modules.activation import get_activation +from paddlespeech.t2s.modules.nets_utils import initialize +from paddlespeech.t2s.modules.residual_block import WaveNetResidualBlock as ResidualBlock +from paddlespeech.t2s.modules.upsample import ConvInUpsampleNet class PWGGenerator(nn.Layer): @@ -331,7 +43,6 @@ class PWGGenerator(nn.Layer): Number of residual blocks inside, by default 30 stacks : int, optional The number of groups to split the residual blocks into, by default 3 - Within each group, the dilation of the residual block grows exponentially. residual_channels : int, optional @@ -367,27 +78,37 @@ class PWGGenerator(nn.Layer): Kernel size along the frequency axis of the upsample network, by default 1 """ - def __init__(self, - in_channels: int=1, - out_channels: int=1, - kernel_size: int=3, - layers: int=30, - stacks: int=3, - residual_channels: int=64, - gate_channels: int=128, - skip_channels: int=64, - aux_channels: int=80, - aux_context_window: int=2, - dropout: float=0., - bias: bool=True, - use_weight_norm: bool=True, - use_causal_conv: bool=False, - upsample_scales: List[int]=[4, 4, 4, 4], - nonlinear_activation: Optional[str]=None, - nonlinear_activation_params: Dict[str, Any]={}, - interpolate_mode: str="nearest", - freq_axis_kernel_size: int=1): + def __init__( + self, + in_channels: int=1, + out_channels: int=1, + kernel_size: int=3, + layers: int=30, + stacks: int=3, + residual_channels: int=64, + gate_channels: int=128, + skip_channels: int=64, + aux_channels: int=80, + aux_context_window: int=2, + dropout: float=0., + bias: bool=True, + use_weight_norm: bool=True, + use_causal_conv: bool=False, + upsample_scales: List[int]=[4, 4, 4, 4], + nonlinear_activation: Optional[str]=None, + nonlinear_activation_params: Dict[str, Any]={}, + interpolate_mode: str="nearest", + freq_axis_kernel_size: int=1, + init_type: str="xavier_uniform", ): super().__init__() + + # initialize parameters + initialize(self, init_type) + + # for compatibility + if nonlinear_activation: + nonlinear_activation = nonlinear_activation.lower() + self.in_channels = in_channels self.out_channels = out_channels self.aux_channels = aux_channels @@ -540,7 +261,7 @@ class PWGDiscriminator(nn.Layer): exponentially if it is greater than 1, else the dilation of each convolutional sublayers grows linearly, by default 1 nonlinear_activation : str, optional - The activation after each convolutional sublayer, by default "LeakyReLU" + The activation after each convolutional sublayer, by default "leakyrelu" nonlinear_activation_params : Dict[str, Any], optional The parameters passed to the activation's initializer, by default {"negative_slope": 0.2} @@ -559,11 +280,19 @@ class PWGDiscriminator(nn.Layer): layers: int=10, conv_channels: int=64, dilation_factor: int=1, - nonlinear_activation: str="LeakyReLU", + nonlinear_activation: str="leakyrelu", nonlinear_activation_params: Dict[str, Any]={"negative_slope": 0.2}, bias: bool=True, - use_weight_norm: bool=True): + use_weight_norm: bool=True, + init_type: str="xavier_uniform", ): super().__init__() + + # initialize parameters + initialize(self, init_type) + # for compatibility + if nonlinear_activation: + nonlinear_activation = nonlinear_activation.lower() + assert kernel_size % 2 == 1 assert dilation_factor > 0 conv_layers = [] @@ -582,8 +311,8 @@ class PWGDiscriminator(nn.Layer): padding=padding, dilation=dilation, bias_attr=bias) - nonlinear = getattr( - nn, nonlinear_activation)(**nonlinear_activation_params) + nonlinear = get_activation(nonlinear_activation, + **nonlinear_activation_params) conv_layers.append(conv_layer) conv_layers.append(nonlinear) padding = (kernel_size - 1) // 2 @@ -663,28 +392,37 @@ class ResidualPWGDiscriminator(nn.Layer): Whether to use causal convolution in residual blocks, by default False nonlinear_activation : str, optional Activation after convolutions other than those in residual blocks, - by default "LeakyReLU" + by default "leakyrelu" nonlinear_activation_params : Dict[str, Any], optional Parameters to pass to the activation, by default {"negative_slope": 0.2} """ - def __init__(self, - in_channels: int=1, - out_channels: int=1, - kernel_size: int=3, - layers: int=30, - stacks: int=3, - residual_channels: int=64, - gate_channels: int=128, - skip_channels: int=64, - dropout: float=0., - bias: bool=True, - use_weight_norm: bool=True, - use_causal_conv: bool=False, - nonlinear_activation: str="LeakyReLU", - nonlinear_activation_params: Dict[ - str, Any]={"negative_slope": 0.2}): + def __init__( + self, + in_channels: int=1, + out_channels: int=1, + kernel_size: int=3, + layers: int=30, + stacks: int=3, + residual_channels: int=64, + gate_channels: int=128, + skip_channels: int=64, + dropout: float=0., + bias: bool=True, + use_weight_norm: bool=True, + use_causal_conv: bool=False, + nonlinear_activation: str="leakyrelu", + nonlinear_activation_params: Dict[str, Any]={"negative_slope": 0.2}, + init_type: str="xavier_uniform", ): super().__init__() + + # initialize parameters + initialize(self, init_type) + + # for compatibility + if nonlinear_activation: + nonlinear_activation = nonlinear_activation.lower() + assert kernel_size % 2 == 1 self.in_channels = in_channels self.out_channels = out_channels @@ -697,7 +435,7 @@ class ResidualPWGDiscriminator(nn.Layer): self.first_conv = nn.Sequential( nn.Conv1D(in_channels, residual_channels, 1, bias_attr=True), - getattr(nn, nonlinear_activation)(**nonlinear_activation_params)) + get_activation(nonlinear_activation, **nonlinear_activation_params)) self.conv_layers = nn.LayerList() for layer in range(layers): @@ -715,9 +453,9 @@ class ResidualPWGDiscriminator(nn.Layer): self.conv_layers.append(conv) self.last_conv_layers = nn.Sequential( - getattr(nn, nonlinear_activation)(**nonlinear_activation_params), + get_activation(nonlinear_activation, **nonlinear_activation_params), nn.Conv1D(skip_channels, skip_channels, 1, bias_attr=True), - getattr(nn, nonlinear_activation)(**nonlinear_activation_params), + get_activation(nonlinear_activation, **nonlinear_activation_params), nn.Conv1D(skip_channels, out_channels, 1, bias_attr=True)) if use_weight_norm: diff --git a/paddlespeech/t2s/models/parallel_wavegan/parallel_wavegan_updater.py b/paddlespeech/t2s/models/parallel_wavegan/parallel_wavegan_updater.py index 79707aa4ebcade609a49f1aa6d5eae2e535398bd..40cfff5a5eedf54754a2f7ef6388964a0949f6f2 100644 --- a/paddlespeech/t2s/models/parallel_wavegan/parallel_wavegan_updater.py +++ b/paddlespeech/t2s/models/parallel_wavegan/parallel_wavegan_updater.py @@ -21,7 +21,6 @@ from paddle.io import DataLoader from paddle.nn import Layer from paddle.optimizer import Optimizer from paddle.optimizer.lr import LRScheduler -from timer import timer from paddlespeech.t2s.training.extensions.evaluator import StandardEvaluator from paddlespeech.t2s.training.reporter import report @@ -42,8 +41,10 @@ class PWGUpdater(StandardUpdater): criterions: Dict[str, Layer], schedulers: Dict[str, LRScheduler], dataloader: DataLoader, - discriminator_train_start_steps: int, - lambda_adv: float, + generator_train_start_steps: int=0, + discriminator_train_start_steps: int=100000, + lambda_adv: float=1.0, + lambda_aux: float=1.0, output_dir: Path=None): self.models = models self.generator: Layer = models['generator'] @@ -63,8 +64,10 @@ class PWGUpdater(StandardUpdater): self.dataloader = dataloader + self.generator_train_start_steps = generator_train_start_steps self.discriminator_train_start_steps = discriminator_train_start_steps self.lambda_adv = lambda_adv + self.lambda_aux = lambda_aux self.state = UpdaterState(iteration=0, epoch=0) self.train_iterator = iter(self.dataloader) @@ -78,56 +81,46 @@ class PWGUpdater(StandardUpdater): def update_core(self, batch): self.msg = "Rank: {}, ".format(dist.get_rank()) losses_dict = {} - # parse batch wav, mel = batch # Generator - noise = paddle.randn(wav.shape) - - with timer() as t: + if self.state.iteration > self.generator_train_start_steps: + noise = paddle.randn(wav.shape) wav_ = self.generator(noise, mel) - # logging.debug(f"Generator takes {t.elapse}s.") - # initialize - gen_loss = 0.0 + # initialize + gen_loss = 0.0 + aux_loss = 0.0 - ## Multi-resolution stft loss - with timer() as t: + # multi-resolution stft loss sc_loss, mag_loss = self.criterion_stft(wav_, wav) - # logging.debug(f"Multi-resolution STFT loss takes {t.elapse}s.") + aux_loss += sc_loss + mag_loss + report("train/spectral_convergence_loss", float(sc_loss)) + report("train/log_stft_magnitude_loss", float(mag_loss)) - report("train/spectral_convergence_loss", float(sc_loss)) - report("train/log_stft_magnitude_loss", float(mag_loss)) + gen_loss += aux_loss * self.lambda_aux - losses_dict["spectral_convergence_loss"] = float(sc_loss) - losses_dict["log_stft_magnitude_loss"] = float(mag_loss) - - gen_loss += sc_loss + mag_loss + losses_dict["spectral_convergence_loss"] = float(sc_loss) + losses_dict["log_stft_magnitude_loss"] = float(mag_loss) - ## Adversarial loss - if self.state.iteration > self.discriminator_train_start_steps: - with timer() as t: + # adversarial loss + if self.state.iteration > self.discriminator_train_start_steps: p_ = self.discriminator(wav_) adv_loss = self.criterion_mse(p_, paddle.ones_like(p_)) - # logging.debug( - # f"Discriminator and adversarial loss takes {t.elapse}s") - report("train/adversarial_loss", float(adv_loss)) - losses_dict["adversarial_loss"] = float(adv_loss) - gen_loss += self.lambda_adv * adv_loss + report("train/adversarial_loss", float(adv_loss)) + losses_dict["adversarial_loss"] = float(adv_loss) - report("train/generator_loss", float(gen_loss)) - losses_dict["generator_loss"] = float(gen_loss) + gen_loss += self.lambda_adv * adv_loss + + report("train/generator_loss", float(gen_loss)) + losses_dict["generator_loss"] = float(gen_loss) - with timer() as t: self.optimizer_g.clear_grad() gen_loss.backward() - # logging.debug(f"Backward takes {t.elapse}s.") - with timer() as t: self.optimizer_g.step() self.scheduler_g.step() - # logging.debug(f"Update takes {t.elapse}s.") # Disctiminator if self.state.iteration > self.discriminator_train_start_steps: @@ -160,7 +153,8 @@ class PWGEvaluator(StandardEvaluator): models: Dict[str, Layer], criterions: Dict[str, Layer], dataloader: DataLoader, - lambda_adv: float, + lambda_adv: float=1.0, + lambda_aux: float=1.0, output_dir: Path=None): self.models = models self.generator = models['generator'] @@ -171,7 +165,9 @@ class PWGEvaluator(StandardEvaluator): self.criterion_mse = criterions['mse'] self.dataloader = dataloader + self.lambda_adv = lambda_adv + self.lambda_aux = lambda_aux log_file = output_dir / 'worker_{}.log'.format(dist.get_rank()) self.filehandler = logging.FileHandler(str(log_file)) @@ -183,34 +179,33 @@ class PWGEvaluator(StandardEvaluator): # logging.debug("Evaluate: ") self.msg = "Evaluate: " losses_dict = {} - wav, mel = batch noise = paddle.randn(wav.shape) - with timer() as t: - wav_ = self.generator(noise, mel) - # logging.debug(f"Generator takes {t.elapse}s") - - ## Adversarial loss - with timer() as t: - p_ = self.discriminator(wav_) - adv_loss = self.criterion_mse(p_, paddle.ones_like(p_)) - # logging.debug( - # f"Discriminator and adversarial loss takes {t.elapse}s") + # Generator + wav_ = self.generator(noise, mel) + + # initialize + gen_loss = 0.0 + aux_loss = 0.0 + + # adversarial loss + p_ = self.discriminator(wav_) + adv_loss = self.criterion_mse(p_, paddle.ones_like(p_)) report("eval/adversarial_loss", float(adv_loss)) losses_dict["adversarial_loss"] = float(adv_loss) - gen_loss = self.lambda_adv * adv_loss - # stft loss - with timer() as t: - sc_loss, mag_loss = self.criterion_stft(wav_, wav) - # logging.debug(f"Multi-resolution STFT loss takes {t.elapse}s") + gen_loss += self.lambda_adv * adv_loss + # multi-resolution stft loss + sc_loss, mag_loss = self.criterion_stft(wav_, wav) report("eval/spectral_convergence_loss", float(sc_loss)) report("eval/log_stft_magnitude_loss", float(mag_loss)) losses_dict["spectral_convergence_loss"] = float(sc_loss) losses_dict["log_stft_magnitude_loss"] = float(mag_loss) - gen_loss += sc_loss + mag_loss + aux_loss += sc_loss + mag_loss + + gen_loss += aux_loss * self.lambda_aux report("eval/generator_loss", float(gen_loss)) losses_dict["generator_loss"] = float(gen_loss) diff --git a/paddlespeech/t2s/modules/losses.py b/paddlespeech/t2s/modules/losses.py index 6b0ab6b3342f39e2f9dc8e17d4d3f69ff82460a2..569e96ada6f493ecba6d923ed8fa02cac7500fb8 100644 --- a/paddlespeech/t2s/modules/losses.py +++ b/paddlespeech/t2s/modules/losses.py @@ -13,6 +13,7 @@ # limitations under the License. import math +import librosa import paddle from paddle import nn from paddle.fluid.layers import sequence_mask @@ -457,3 +458,206 @@ def masked_l1_loss(prediction, target, mask): abs_error = F.l1_loss(prediction, target, reduction='none') loss = weighted_mean(abs_error, mask) return loss + + +class MelSpectrogram(nn.Layer): + """Calculate Mel-spectrogram.""" + + def __init__( + self, + fs=22050, + fft_size=1024, + hop_size=256, + win_length=None, + window="hann", + num_mels=80, + fmin=80, + fmax=7600, + center=True, + normalized=False, + onesided=True, + eps=1e-10, + log_base=10.0, ): + """Initialize MelSpectrogram module.""" + super().__init__() + self.fft_size = fft_size + if win_length is None: + self.win_length = fft_size + else: + self.win_length = win_length + self.hop_size = hop_size + self.center = center + self.normalized = normalized + self.onesided = onesided + + if window is not None and not hasattr(signal.windows, f"{window}"): + raise ValueError(f"{window} window is not implemented") + self.window = window + self.eps = eps + + fmin = 0 if fmin is None else fmin + fmax = fs / 2 if fmax is None else fmax + melmat = librosa.filters.mel( + sr=fs, + n_fft=fft_size, + n_mels=num_mels, + fmin=fmin, + fmax=fmax, ) + + self.melmat = paddle.to_tensor(melmat.T) + self.stft_params = { + "n_fft": self.fft_size, + "win_length": self.win_length, + "hop_length": self.hop_size, + "center": self.center, + "normalized": self.normalized, + "onesided": self.onesided, + } + + self.log_base = log_base + if self.log_base is None: + self.log = paddle.log + elif self.log_base == 2.0: + self.log = paddle.log2 + elif self.log_base == 10.0: + self.log = paddle.log10 + else: + raise ValueError(f"log_base: {log_base} is not supported.") + + def forward(self, x): + """Calculate Mel-spectrogram. + Parameters + ---------- + x : Tensor + Input waveform tensor (B, T) or (B, 1, T). + Returns + ---------- + Tensor + Mel-spectrogram (B, #mels, #frames). + """ + if len(x.shape) == 3: + # (B, C, T) -> (B*C, T) + x = x.reshape([-1, paddle.shape(x)[2]]) + + if self.window is not None: + # calculate window + window = signal.get_window( + self.window, self.win_length, fftbins=True) + window = paddle.to_tensor(window) + else: + window = None + + x_stft = paddle.signal.stft(x, window=window, **self.stft_params) + real = x_stft.real() + imag = x_stft.imag() + # (B, #freqs, #frames) -> (B, $frames, #freqs) + real = real.transpose([0, 2, 1]) + imag = imag.transpose([0, 2, 1]) + x_power = real**2 + imag**2 + x_amp = paddle.sqrt(paddle.clip(x_power, min=self.eps)) + x_mel = paddle.matmul(x_amp, self.melmat) + x_mel = paddle.clip(x_mel, min=self.eps) + + return self.log(x_mel).transpose([0, 2, 1]) + + +class MelSpectrogramLoss(nn.Layer): + """Mel-spectrogram loss.""" + + def __init__( + self, + fs=22050, + fft_size=1024, + hop_size=256, + win_length=None, + window="hann", + num_mels=80, + fmin=80, + fmax=7600, + center=True, + normalized=False, + onesided=True, + eps=1e-10, + log_base=10.0, ): + """Initialize Mel-spectrogram loss.""" + super().__init__() + self.mel_spectrogram = MelSpectrogram( + fs=fs, + fft_size=fft_size, + hop_size=hop_size, + win_length=win_length, + window=window, + num_mels=num_mels, + fmin=fmin, + fmax=fmax, + center=center, + normalized=normalized, + onesided=onesided, + eps=eps, + log_base=log_base, ) + + def forward(self, y_hat, y): + """Calculate Mel-spectrogram loss. + Parameters + ---------- + y_hat : Tensor + Generated single tensor (B, 1, T). + y : Tensor + Groundtruth single tensor (B, 1, T). + Returns + ---------- + Tensor + Mel-spectrogram loss value. + """ + mel_hat = self.mel_spectrogram(y_hat) + mel = self.mel_spectrogram(y) + mel_loss = F.l1_loss(mel_hat, mel) + + return mel_loss + + +class FeatureMatchLoss(nn.Layer): + """Feature matching loss module.""" + + def __init__( + self, + average_by_layers=True, + average_by_discriminators=True, + include_final_outputs=False, ): + """Initialize FeatureMatchLoss module.""" + super().__init__() + self.average_by_layers = average_by_layers + self.average_by_discriminators = average_by_discriminators + self.include_final_outputs = include_final_outputs + + def forward(self, feats_hat, feats): + """Calcualate feature matching loss. + Parameters + ---------- + feats_hat : list + List of list of discriminator outputs + calcuated from generater outputs. + feats : list + List of list of discriminator outputs + calcuated from groundtruth. + Returns + ---------- + Tensor + Feature matching loss value. + + """ + feat_match_loss = 0.0 + for i, (feats_hat_, feats_) in enumerate(zip(feats_hat, feats)): + feat_match_loss_ = 0.0 + if not self.include_final_outputs: + feats_hat_ = feats_hat_[:-1] + feats_ = feats_[:-1] + for j, (feat_hat_, feat_) in enumerate(zip(feats_hat_, feats_)): + feat_match_loss_ += F.l1_loss(feat_hat_, feat_.detach()) + if self.average_by_layers: + feat_match_loss_ /= j + 1 + feat_match_loss += feat_match_loss_ + if self.average_by_discriminators: + feat_match_loss /= i + 1 + + return feat_match_loss diff --git a/paddlespeech/t2s/modules/residual_block.py b/paddlespeech/t2s/modules/residual_block.py new file mode 100644 index 0000000000000000000000000000000000000000..a96a8946362c283f6205509d7dafa887420a21a2 --- /dev/null +++ b/paddlespeech/t2s/modules/residual_block.py @@ -0,0 +1,207 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import math +from typing import Any +from typing import Dict +from typing import List + +import paddle +from paddle import nn +from paddle.nn import functional as F + +from paddlespeech.t2s.modules.activation import get_activation + + +class WaveNetResidualBlock(nn.Layer): + """A gated activation unit composed of an 1D convolution, a gated tanh + unit and parametric redidual and skip connections. For more details, + refer to `WaveNet: A Generative Model for Raw Audio `_. + + Parameters + ---------- + kernel_size : int, optional + Kernel size of the 1D convolution, by default 3 + residual_channels : int, optional + Feature size of the resiaudl output(and also the input), by default 64 + gate_channels : int, optional + Output feature size of the 1D convolution, by default 128 + skip_channels : int, optional + Feature size of the skip output, by default 64 + aux_channels : int, optional + Feature size of the auxiliary input (e.g. spectrogram), by default 80 + dropout : float, optional + Probability of the dropout before the 1D convolution, by default 0. + dilation : int, optional + Dilation of the 1D convolution, by default 1 + bias : bool, optional + Whether to use bias in the 1D convolution, by default True + use_causal_conv : bool, optional + Whether to use causal padding for the 1D convolution, by default False + """ + + def __init__(self, + kernel_size: int=3, + residual_channels: int=64, + gate_channels: int=128, + skip_channels: int=64, + aux_channels: int=80, + dropout: float=0., + dilation: int=1, + bias: bool=True, + use_causal_conv: bool=False): + super().__init__() + self.dropout = dropout + if use_causal_conv: + padding = (kernel_size - 1) * dilation + else: + assert kernel_size % 2 == 1 + padding = (kernel_size - 1) // 2 * dilation + self.use_causal_conv = use_causal_conv + + self.conv = nn.Conv1D( + residual_channels, + gate_channels, + kernel_size, + padding=padding, + dilation=dilation, + bias_attr=bias) + if aux_channels is not None: + self.conv1x1_aux = nn.Conv1D( + aux_channels, gate_channels, kernel_size=1, bias_attr=False) + else: + self.conv1x1_aux = None + + gate_out_channels = gate_channels // 2 + self.conv1x1_out = nn.Conv1D( + gate_out_channels, residual_channels, kernel_size=1, bias_attr=bias) + self.conv1x1_skip = nn.Conv1D( + gate_out_channels, skip_channels, kernel_size=1, bias_attr=bias) + + def forward(self, x, c): + """ + Parameters + ---------- + x : Tensor + Shape (N, C_res, T), the input features. + c : Tensor + Shape (N, C_aux, T), the auxiliary input. + + Returns + ------- + res : Tensor + Shape (N, C_res, T), the residual output, which is used as the + input of the next ResidualBlock in a stack of ResidualBlocks. + skip : Tensor + Shape (N, C_skip, T), the skip output, which is collected among + each layer in a stack of ResidualBlocks. + """ + x_input = x + x = F.dropout(x, self.dropout, training=self.training) + x = self.conv(x) + x = x[:, :, x_input.shape[-1]] if self.use_causal_conv else x + if c is not None: + c = self.conv1x1_aux(c) + x += c + + a, b = paddle.chunk(x, 2, axis=1) + x = paddle.tanh(a) * F.sigmoid(b) + + skip = self.conv1x1_skip(x) + res = (self.conv1x1_out(x) + x_input) * math.sqrt(0.5) + return res, skip + + +class HiFiGANResidualBlock(nn.Layer): + """Residual block module in HiFiGAN.""" + + def __init__( + self, + kernel_size: int=3, + channels: int=512, + dilations: List[int]=(1, 3, 5), + bias: bool=True, + use_additional_convs: bool=True, + nonlinear_activation: str="leakyrelu", + nonlinear_activation_params: Dict[str, Any]={"negative_slope": 0.1}, + ): + """Initialize HiFiGANResidualBlock module. + Parameters + ---------- + kernel_size : int + Kernel size of dilation convolution layer. + channels : int + Number of channels for convolution layer. + dilations : List[int] + List of dilation factors. + use_additional_convs : bool + Whether to use additional convolution layers. + bias : bool + Whether to add bias parameter in convolution layers. + nonlinear_activation : str + Activation function module name. + nonlinear_activation_params : dict + Hyperparameters for activation function. + """ + super().__init__() + + self.use_additional_convs = use_additional_convs + self.convs1 = nn.LayerList() + if use_additional_convs: + self.convs2 = nn.LayerList() + assert kernel_size % 2 == 1, "Kernel size must be odd number." + + for dilation in dilations: + self.convs1.append( + nn.Sequential( + get_activation(nonlinear_activation, ** + nonlinear_activation_params), + nn.Conv1D( + channels, + channels, + kernel_size, + 1, + dilation=dilation, + bias_attr=bias, + padding=(kernel_size - 1) // 2 * dilation, ), )) + if use_additional_convs: + self.convs2.append( + nn.Sequential( + get_activation(nonlinear_activation, ** + nonlinear_activation_params), + nn.Conv1D( + channels, + channels, + kernel_size, + 1, + dilation=1, + bias_attr=bias, + padding=(kernel_size - 1) // 2, ), )) + + def forward(self, x): + """Calculate forward propagation. + Parameters + ---------- + x : Tensor + Input tensor (B, channels, T). + Returns + ---------- + Tensor + Output tensor (B, channels, T). + """ + for idx in range(len(self.convs1)): + xt = self.convs1[idx](x) + if self.use_additional_convs: + xt = self.convs2[idx](xt) + x = xt + x + return x diff --git a/paddlespeech/t2s/modules/residual_stack.py b/paddlespeech/t2s/modules/residual_stack.py index b4f95229c7e4e0d0510bf9265fc704d1d690fcb7..c885dfe9da7aeb0ad3d376428dd093e6418b86b9 100644 --- a/paddlespeech/t2s/modules/residual_stack.py +++ b/paddlespeech/t2s/modules/residual_stack.py @@ -60,7 +60,8 @@ class ResidualStack(nn.Layer): """ super().__init__() # for compatibility - nonlinear_activation = nonlinear_activation.lower() + if nonlinear_activation: + nonlinear_activation = nonlinear_activation.lower() # defile residual stack part if not use_causal_conv: diff --git a/paddlespeech/t2s/modules/upsample.py b/paddlespeech/t2s/modules/upsample.py new file mode 100644 index 0000000000000000000000000000000000000000..82e30414a13411b4703c44d30a83ad36e02cc283 --- /dev/null +++ b/paddlespeech/t2s/modules/upsample.py @@ -0,0 +1,220 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# Modified from espnet(https://github.com/espnet/espnet) +from typing import Any +from typing import Dict +from typing import List +from typing import Optional + +from paddle import nn +from paddle.nn import functional as F + +from paddlespeech.t2s.modules.activation import get_activation + + +class Stretch2D(nn.Layer): + def __init__(self, w_scale: int, h_scale: int, mode: str="nearest"): + """Strech an image (or image-like object) with some interpolation. + + Parameters + ---------- + w_scale : int + Scalar of width. + h_scale : int + Scalar of the height. + mode : str, optional + Interpolation mode, modes suppored are "nearest", "bilinear", + "trilinear", "bicubic", "linear" and "area",by default "nearest" + + For more details about interpolation, see + `paddle.nn.functional.interpolate `_. + """ + super().__init__() + self.w_scale = w_scale + self.h_scale = h_scale + self.mode = mode + + def forward(self, x): + """ + Parameters + ---------- + x : Tensor + Shape (N, C, H, W) + + Returns + ------- + Tensor + Shape (N, C, H', W'), where ``H'=h_scale * H``, ``W'=w_scale * W``. + The stretched image. + """ + out = F.interpolate( + x, scale_factor=(self.h_scale, self.w_scale), mode=self.mode) + return out + + +class UpsampleNet(nn.Layer): + """A Layer to upsample spectrogram by applying consecutive stretch and + convolutions. + + Parameters + ---------- + upsample_scales : List[int] + Upsampling factors for each strech. + nonlinear_activation : Optional[str], optional + Activation after each convolution, by default None + nonlinear_activation_params : Dict[str, Any], optional + Parameters passed to construct the activation, by default {} + interpolate_mode : str, optional + Interpolation mode of the strech, by default "nearest" + freq_axis_kernel_size : int, optional + Convolution kernel size along the frequency axis, by default 1 + use_causal_conv : bool, optional + Whether to use causal padding before convolution, by default False + + If True, Causal padding is used along the time axis, i.e. padding + amount is ``receptive field - 1`` and 0 for before and after, + respectively. + + If False, "same" padding is used along the time axis. + """ + + def __init__(self, + upsample_scales: List[int], + nonlinear_activation: Optional[str]=None, + nonlinear_activation_params: Dict[str, Any]={}, + interpolate_mode: str="nearest", + freq_axis_kernel_size: int=1, + use_causal_conv: bool=False): + super().__init__() + self.use_causal_conv = use_causal_conv + self.up_layers = nn.LayerList() + + for scale in upsample_scales: + stretch = Stretch2D(scale, 1, interpolate_mode) + assert freq_axis_kernel_size % 2 == 1 + freq_axis_padding = (freq_axis_kernel_size - 1) // 2 + kernel_size = (freq_axis_kernel_size, scale * 2 + 1) + if use_causal_conv: + padding = (freq_axis_padding, scale * 2) + else: + padding = (freq_axis_padding, scale) + conv = nn.Conv2D( + 1, 1, kernel_size, padding=padding, bias_attr=False) + self.up_layers.extend([stretch, conv]) + if nonlinear_activation is not None: + # for compatibility + nonlinear_activation = nonlinear_activation.lower() + + nonlinear = get_activation(nonlinear_activation, + **nonlinear_activation_params) + self.up_layers.append(nonlinear) + + def forward(self, c): + """ + Parameters + ---------- + c : Tensor + Shape (N, F, T), spectrogram + + Returns + ------- + Tensor + Shape (N, F, T'), where ``T' = upsample_factor * T``, upsampled + spectrogram + """ + c = c.unsqueeze(1) + for f in self.up_layers: + if self.use_causal_conv and isinstance(f, nn.Conv2D): + c = f(c)[:, :, :, c.shape[-1]] + else: + c = f(c) + return c.squeeze(1) + + +class ConvInUpsampleNet(nn.Layer): + """A Layer to upsample spectrogram composed of a convolution and an + UpsampleNet. + + Parameters + ---------- + upsample_scales : List[int] + Upsampling factors for each strech. + nonlinear_activation : Optional[str], optional + Activation after each convolution, by default None + nonlinear_activation_params : Dict[str, Any], optional + Parameters passed to construct the activation, by default {} + interpolate_mode : str, optional + Interpolation mode of the strech, by default "nearest" + freq_axis_kernel_size : int, optional + Convolution kernel size along the frequency axis, by default 1 + aux_channels : int, optional + Feature size of the input, by default 80 + aux_context_window : int, optional + Context window of the first 1D convolution applied to the input. It + related to the kernel size of the convolution, by default 0 + + If use causal convolution, the kernel size is ``window + 1``, else + the kernel size is ``2 * window + 1``. + use_causal_conv : bool, optional + Whether to use causal padding before convolution, by default False + + If True, Causal padding is used along the time axis, i.e. padding + amount is ``receptive field - 1`` and 0 for before and after, + respectively. + + If False, "same" padding is used along the time axis. + """ + + def __init__(self, + upsample_scales: List[int], + nonlinear_activation: Optional[str]=None, + nonlinear_activation_params: Dict[str, Any]={}, + interpolate_mode: str="nearest", + freq_axis_kernel_size: int=1, + aux_channels: int=80, + aux_context_window: int=0, + use_causal_conv: bool=False): + super().__init__() + self.aux_context_window = aux_context_window + self.use_causal_conv = use_causal_conv and aux_context_window > 0 + kernel_size = aux_context_window + 1 if use_causal_conv else 2 * aux_context_window + 1 + self.conv_in = nn.Conv1D( + aux_channels, + aux_channels, + kernel_size=kernel_size, + bias_attr=False) + self.upsample = UpsampleNet( + upsample_scales=upsample_scales, + nonlinear_activation=nonlinear_activation, + nonlinear_activation_params=nonlinear_activation_params, + interpolate_mode=interpolate_mode, + freq_axis_kernel_size=freq_axis_kernel_size, + use_causal_conv=use_causal_conv) + + def forward(self, c): + """ + Parameters + ---------- + c : Tensor + Shape (N, F, T), spectrogram + + Returns + ------- + Tensors + Shape (N, F, T'), where ``T' = upsample_factor * T``, upsampled + spectrogram + """ + c_ = self.conv_in(c) + c = c_[:, :, :-self.aux_context_window] if self.use_causal_conv else c_ + return self.upsample(c)