# Finetune your own AM based on FastSpeech2 with AISHELL-3.
This example shows how to finetune your own AM based on FastSpeech2 with AISHELL-3. We use part of csmsc's data (top 200) as finetune data in this example. The example is implemented according to this [discussion](https://github.com/PaddlePaddle/PaddleSpeech/discussions/1842). Thanks to the developer for the idea.
# Finetune your own AM based on FastSpeech2 with multi-speakers dataset.
This example shows how to finetune your own AM based on FastSpeech2 with multi-speakers dataset. For finetuning Chinese data, we use part of csmsc's data (top 200) and Fastspeech2 pretrained model with AISHELL-3. For finetuning English data, we use part of ljspeech's data (top 200) and Fastspeech2 pretrained model with VCTK. The example is implemented according to this [discussion](https://github.com/PaddlePaddle/PaddleSpeech/discussions/1842). Thanks to the developer for the idea.
For more information on training Fastspeech2 with AISHELL-3, You can refer [examples/aishell3/tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3). For more information on training Fastspeech2 with VCTK, You can refer [examples/vctk/tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/tts3).
We use AISHELL-3 to train a multi-speaker fastspeech2 model. You can refer [examples/aishell3/tts3](https://github.com/lym0302/PaddleSpeech/tree/develop/examples/aishell3/tts3) to train multi-speaker fastspeech2 from scratch.
## Prepare
### Download Pretrained Fastspeech2 model
Assume the path to the model is `./pretrained_models`. Download pretrained fastspeech2 model with aishell3: [fastspeech2_aishell3_ckpt_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_ckpt_1.1.0.zip).
### Download Pretrained model
Assume the path to the model is `./pretrained_models`. </br>
If you want to finetune Chinese data, you need to download Fastspeech2 pretrained model with AISHELL-3: [fastspeech2_aishell3_ckpt_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_ckpt_1.1.0.zip) for finetuning. Download HiFiGAN pretrained model with aishell3: [hifigan_aishell3_ckpt_0.2.0](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip) for synthesis.
If you want to finetune English data, you need to download Fastspeech2 pretrained model with VCTK: [fastspeech2_vctk_ckpt_1.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_vctk_ckpt_1.2.0.zip) for finetuning. Download HiFiGAN pretrained model with VCTK: [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip) for synthesis.
Assume the path to the MFA tool is `./tools`. Download [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.0.1/montreal-forced-aligner_linux.tar.gz) and pretrained MFA models with aishell3: [aishell3_model.zip](https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/aishell3_model.zip).
Assume the path to the MFA tool is `./tools`. Download [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.0.1/montreal-forced-aligner_linux.tar.gz).
If you want to finetune Chinese data, you need to download pretrained MFA models with aishell3: [aishell3_model.zip](https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/aishell3_model.zip) and unzip it.
If you want to finetune English data, you need to download pretrained MFA models with vctk: [vctk_model.zip](https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/vctk_model.zip) and unzip it.
Assume the path to the dataset is `./input`. This directory contains audio files (*.wav) and label file (labels.txt). The audio file is in wav format. The format of the label file is: utt_id|pinyin. Here is an example of the first 200 data of csmsc.
Assume the path to the dataset is `./input` which contains a speaker folder. Speaker folder contains audio files (*.wav) and label file (labels.txt). The format of the audio file is wav. The format of the label file is: utt_id|pronunciation. </br>
If you want to finetune Chinese data, Chinese label example: 000001|ka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1</br>
Here is an example of the first 200 data of csmsc.
```bash
mkdir-p input &&cd input
...
...
@@ -60,7 +99,12 @@ When "Prepare" done. The structure of the current directory is listed below.
│ │ ├── snapshot_iter_96400.pdz
│ │ ├── speaker_id_map.txt
│ │ └── speech_stats.npy
│ └── fastspeech2_aishell3_ckpt_1.1.0.zip
│ ├── fastspeech2_aishell3_ckpt_1.1.0.zip
│ ├── hifigan_aishell3_ckpt_0.2.0
│ │ ├── default.yaml
│ │ ├── feats_stats.npy
│ │ └── snapshot_iter_2500000.pdz
│ └── hifigan_aishell3_ckpt_0.2.0.zip
└── tools
├── aligner
│ ├── aishell3_model
...
...
@@ -75,17 +119,68 @@ When "Prepare" done. The structure of the current directory is listed below.
```
If you want to finetune English data, English label example: LJ001-0001|Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition </br>
Here is an example of the first 200 data of ljspeech.
When "Prepare" done. The structure of the current directory is listed below.
```text
├── input
│ ├── ljspeech_mini
│ │ ├── LJ001-0001.wav
│ │ ├── LJ001-0002.wav
│ │ ├── LJ001-0003.wav
│ │ ├── ...
│ │ ├── LJ002-0014.wav
│ │ ├── labels.txt
│ └── ljspeech_mini.zip
├── pretrained_models
│ ├── fastspeech2_vctk_ckpt_1.2.0
│ │ ├── default.yaml
│ │ ├── energy_stats.npy
│ │ ├── phone_id_map.txt
│ │ ├── pitch_stats.npy
│ │ ├── snapshot_iter_66200.pdz
│ │ ├── speaker_id_map.txt
│ │ └── speech_stats.npy
│ ├── fastspeech2_vctk_ckpt_1.2.0.zip
│ ├── hifigan_vctk_ckpt_0.2.0
│ │ ├── default.yaml
│ │ ├── feats_stats.npy
│ │ └── snapshot_iter_2500000.pdz
│ └── hifigan_vctk_ckpt_0.2.0.zip
└── tools
├── aligner
│ ├── vctk_model
│ ├── vctk_model.zip
│ └── cmudict-0.7b
├── montreal-forced-aligner
│ ├── bin
│ ├── lib
│ └── pretrained_models
└── montreal-forced-aligner_linux.tar.gz
...
```
### Set finetune.yaml
`finetune.yaml` contains some configurations for fine-tuning. You can try various options to fine better result.
`conf/finetune.yaml` contains some configurations for fine-tuning. You can try various options to fine better result. The value of frozen_layers can be change according `conf/fastspeech2_layers.txt` which is the model layer of fastspeech2.
Arguments:
-`batch_size`: finetune batch size. Default: -1, means 64 which same to pretrained model
-`batch_size`: finetune batch size which should be less than or equal to the number of training samples. Default: -1, means 64 which same to pretrained model
-`learning_rate`: learning rate. Default: 0.0001
-`num_snapshots`: number of save models. Default: -1, means 5 which same to pretrained model
-`frozen_layers`: frozen layers. must be a list. If you don't want to frozen any layer, set [].
## Get Started
For Chinese data finetune, execute `./run.sh`. For English data finetune, execute `./run_en.sh`. </br>
Run the command below to
1.**source path**.
2. finetune the model.
...
...
@@ -102,76 +197,59 @@ You can choose a range of stages you want to run, or set `stage` equal to `stop-
Finetune a FastSpeech2 model.
```bash
./run.sh --stage 0 --stop-stage0
./run.sh --stage 0 --stop-stage5
```
`stage 0` of `run.sh` calls `finetune.py`, here's the complete help message.
`stage 5` of `run.sh` calls `local/finetune.py`, here's the complete help message.
the batch size of finetune, default -1 means same as pretrained model
Directory to save finetune model
--ngpu NGPU The number of gpu, if ngpu=0, use cpu
--epoch EPOCH The epoch of finetune
--finetune_config FINETUNE_CONFIG
Path to finetune config file
```
1.`--input_dir` is the directory containing audio and label file.
2.`--pretrained_model_dir` is the directory incluing pretrained fastspeech2_aishell3 model.
3.`--mfa_dir` is the directory to save the results of aligning from pretrained MFA_aishell3 model.
4.`--dump_dir` is the directory including audio feature and metadata.
5.`--output_dir` is the directory to save finetune model.
6.`--lang` is the language of input audio, zh or en.
7.`--ngpu` is the number of gpu.
8.`--epoch` is the epoch of finetune.
9.`--batch_size` is the batch size of finetune.
1.`--pretrained_model_dir` is the directory incluing pretrained fastspeech2_aishell3 model.
2.`--dump_dir` is the directory including audio feature and metadata.
3.`--output_dir` is the directory to save finetune model.
4.`--ngpu` is the number of gpu, if ngpu=0, use cpu
5.`--epoch` is the epoch of finetune.
6.`--finetune_config` is the path to finetune config file
### Synthesizing
We use [HiFiGAN](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc5) as the neural vocoder.
To synthesize Chinese audio, We use [HiFiGAN with aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc5) as the neural vocoder.
Assume the path to the hifigan model is `./pretrained_models`. Download the pretrained HiFiGAN model from [hifigan_aishell3_ckpt_0.2.0](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip) and unzip it.
To synthesize English audio, We use [HiFiGAN with vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/voc5) as the neural vocoder.
Assume the path to the hifigan model is `./pretrained_models`. Download the pretrained HiFiGAN model from [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip) and unzip it.
HiFiGAN checkpoint contains files listed below.
```text
hifigan_aishell3_ckpt_0.2.0
├── default.yaml # default config used to train HiFiGAN
├── feats_stats.npy # statistics used to normalize spectrogram when training HiFiGAN
└── snapshot_iter_2500000.pdz # generator parameters of HiFiGAN
```
Modify `ckpt` in `run.sh` to the final model in `exp/default/checkpoints`.
```bash
./run.sh --stage1 --stop-stage 1
./run.sh --stage6 --stop-stage 6
```
`stage 1` of `run.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file.
`stage 6` of `run.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file.
1.`--am` is acoustic model type with the format {model_name}_{dataset}
2.`--am_config`, `--am_ckpt`, `--am_stat`, `--phones_dict``--speaker_dict` are arguments for acoustic model, which correspond to the 5 files in the fastspeech2 pretrained model.
3.`--voc` is vocoder type with the format {model_name}_{dataset}
...
...
@@ -219,7 +298,8 @@ optional arguments:
7.`--output_dir` is the directory to save synthesized audio files.
8.`--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
### Tips
If you want to get better audio quality, you can use more audios to finetune.
More finetune results can be found on [finetune-fastspeech2-for-csmsc](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html#finetune-fastspeech2-for-csmsc).
If you want to get better audio quality, you can use more audios to finetune or change configuration parameters in `conf/finetune.yaml`.</br>
More finetune results can be found on [finetune-fastspeech2-for-csmsc](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html#finetune-fastspeech2-for-csmsc).</br>
The results show the effect on csmsc_mini: Freeze encoder > Non Frozen > Freeze encoder && duration_predictor.