@@ -159,6 +157,8 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
- 🧩 *Cascaded models application*: as an extension of the typical traditional audio tasks, we combine the workflows of the aforementioned tasks with other fields like Natural language processing (NLP) and Computer Vision (CV).
### Recent Update
- 🔥 2022.09.26: Add Voice Cloning, TTS finetune, and ERNIE-SAT in [PaddleSpeech Web Demo](./demos/speech_web).
+ 小数据微调:基于小数据集的微调方案,内置用12句话标贝中文女声微调示例,你也可以通过一键重置,录制自己的声音,注意在安静环境下录制,效果会更好。你可以在 [【Finetune your own AM based on FastSpeech2 with AISHELL-3】](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/tts_finetune/tts3)中尝试使用自己的数据集进行微调。
PaddleSpeech 是基于飞桨 PaddlePaddle 的语音方向的开源模型库,用于语音和音频中的各种关键任务的开发。支持语音识别,语音合成,声纹识别,声音分类,语音唤醒,语音翻译等多种语音任务,荣获 NAACL2022 Best Demo Award 。如果你喜欢这个示例,欢迎在 github 中 star 收藏鼓励。
@@ -42,7 +42,7 @@ Audio samples generated from ground-truth spectrograms with a vocoder.
<tr>
<td >Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition</td>
@@ -81,7 +81,7 @@ Audio samples generated from ground-truth spectrograms with a vocoder.
<tr>
<td>For although the Chinese took impressions from wood blocks engraved in relief for centuries before the woodcutters of the Netherlands, by a similar process</td>
@@ -119,7 +119,7 @@ Audio samples generated from ground-truth spectrograms with a vocoder.
<tr>
<td>the invention of movable metal letters in the middle of the fifteenth century may justly be considered as the invention of the art of printing.</td>
@@ -1736,3 +1736,141 @@ We use ``FastSpeech2`` + ``ParallelWaveGAN`` here.
<br>
Finetune FastSpeech2 for CSMSC
--------------------------------------
Finetuning demos of `tts_finetune/tts3 <https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/tts_finetune/tts3>`_ for CSMSC dataset.
When finetuning for CSMSC, we thought ``Freeze encoder`` > ``Non Frozen`` > ``Freeze encoder && duration_predictor`` for audio quality.
.. raw:: html
<div class="table">
CSMSC reference audio (fastspeech2_csmsc + hifigan_aishlle3 in CLI): 欢迎使用飞桨语音套件。
ERNIE-SAT speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.
ERNIE-SAT speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.
# Finetune your own AM based on FastSpeech2 with AISHELL-3.
This example shows how to finetune your own AM based on FastSpeech2 with AISHELL-3. We use part of csmsc's data (top 200) as finetune data in this example. The example is implemented according to this [discussion](https://github.com/PaddlePaddle/PaddleSpeech/discussions/1842). Thanks to the developer for the idea.
# Finetune your own AM based on FastSpeech2 with multi-speakers dataset.
This example shows how to finetune your own AM based on FastSpeech2 with multi-speakers dataset. For finetuning Chinese data, we use part of csmsc's data (top 200) and Fastspeech2 pretrained model with AISHELL-3. For finetuning English data, we use part of ljspeech's data (top 200) and Fastspeech2 pretrained model with VCTK. The example is implemented according to this [discussion](https://github.com/PaddlePaddle/PaddleSpeech/discussions/1842). Thanks to the developer for the idea.
For more information on training Fastspeech2 with AISHELL-3, You can refer [examples/aishell3/tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3). For more information on training Fastspeech2 with VCTK, You can refer [examples/vctk/tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/tts3).
We use AISHELL-3 to train a multi-speaker fastspeech2 model. You can refer [examples/aishell3/tts3](https://github.com/lym0302/PaddleSpeech/tree/develop/examples/aishell3/tts3) to train multi-speaker fastspeech2 from scratch.
## Prepare
### Download Pretrained Fastspeech2 model
Assume the path to the model is `./pretrained_models`. Download pretrained fastspeech2 model with aishell3: [fastspeech2_aishell3_ckpt_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_ckpt_1.1.0.zip).
### Download Pretrained model
Assume the path to the model is `./pretrained_models`. </br>
If you want to finetune Chinese data, you need to download Fastspeech2 pretrained model with AISHELL-3: [fastspeech2_aishell3_ckpt_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_ckpt_1.1.0.zip) for finetuning. Download HiFiGAN pretrained model with aishell3: [hifigan_aishell3_ckpt_0.2.0](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip) for synthesis.
If you want to finetune English data, you need to download Fastspeech2 pretrained model with VCTK: [fastspeech2_vctk_ckpt_1.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_vctk_ckpt_1.2.0.zip) for finetuning. Download HiFiGAN pretrained model with VCTK: [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip) for synthesis.
Assume the path to the MFA tool is `./tools`. Download [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.0.1/montreal-forced-aligner_linux.tar.gz) and pretrained MFA models with aishell3: [aishell3_model.zip](https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/aishell3_model.zip).
Assume the path to the MFA tool is `./tools`. Download [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.0.1/montreal-forced-aligner_linux.tar.gz).
If you want to finetune Chinese data, you need to download pretrained MFA models with aishell3: [aishell3_model.zip](https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/aishell3_model.zip) and unzip it.
If you want to finetune English data, you need to download pretrained MFA models with vctk: [vctk_model.zip](https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/vctk_model.zip) and unzip it.
Assume the path to the dataset is `./input`. This directory contains audio files (*.wav) and label file (labels.txt). The audio file is in wav format. The format of the label file is: utt_id|pinyin. Here is an example of the first 200 data of csmsc.
Assume the path to the dataset is `./input` which contains a speaker folder. Speaker folder contains audio files (*.wav) and label file (labels.txt). The format of the audio file is wav. The format of the label file is: utt_id|pronunciation. </br>
If you want to finetune Chinese data, Chinese label example: 000001|ka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1</br>
Here is an example of the first 200 data of csmsc.
```bash
mkdir-p input &&cd input
...
...
@@ -60,7 +99,12 @@ When "Prepare" done. The structure of the current directory is listed below.
│ │ ├── snapshot_iter_96400.pdz
│ │ ├── speaker_id_map.txt
│ │ └── speech_stats.npy
│ └── fastspeech2_aishell3_ckpt_1.1.0.zip
│ ├── fastspeech2_aishell3_ckpt_1.1.0.zip
│ ├── hifigan_aishell3_ckpt_0.2.0
│ │ ├── default.yaml
│ │ ├── feats_stats.npy
│ │ └── snapshot_iter_2500000.pdz
│ └── hifigan_aishell3_ckpt_0.2.0.zip
└── tools
├── aligner
│ ├── aishell3_model
...
...
@@ -75,17 +119,68 @@ When "Prepare" done. The structure of the current directory is listed below.
```
If you want to finetune English data, English label example: LJ001-0001|Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition </br>
Here is an example of the first 200 data of ljspeech.
When "Prepare" done. The structure of the current directory is listed below.
```text
├── input
│ ├── ljspeech_mini
│ │ ├── LJ001-0001.wav
│ │ ├── LJ001-0002.wav
│ │ ├── LJ001-0003.wav
│ │ ├── ...
│ │ ├── LJ002-0014.wav
│ │ ├── labels.txt
│ └── ljspeech_mini.zip
├── pretrained_models
│ ├── fastspeech2_vctk_ckpt_1.2.0
│ │ ├── default.yaml
│ │ ├── energy_stats.npy
│ │ ├── phone_id_map.txt
│ │ ├── pitch_stats.npy
│ │ ├── snapshot_iter_66200.pdz
│ │ ├── speaker_id_map.txt
│ │ └── speech_stats.npy
│ ├── fastspeech2_vctk_ckpt_1.2.0.zip
│ ├── hifigan_vctk_ckpt_0.2.0
│ │ ├── default.yaml
│ │ ├── feats_stats.npy
│ │ └── snapshot_iter_2500000.pdz
│ └── hifigan_vctk_ckpt_0.2.0.zip
└── tools
├── aligner
│ ├── vctk_model
│ ├── vctk_model.zip
│ └── cmudict-0.7b
├── montreal-forced-aligner
│ ├── bin
│ ├── lib
│ └── pretrained_models
└── montreal-forced-aligner_linux.tar.gz
...
```
### Set finetune.yaml
`finetune.yaml` contains some configurations for fine-tuning. You can try various options to fine better result.
`conf/finetune.yaml` contains some configurations for fine-tuning. You can try various options to fine better result. The value of frozen_layers can be change according `conf/fastspeech2_layers.txt` which is the model layer of fastspeech2.
Arguments:
-`batch_size`: finetune batch size. Default: -1, means 64 which same to pretrained model
-`batch_size`: finetune batch size which should be less than or equal to the number of training samples. Default: -1, means 64 which same to pretrained model
-`learning_rate`: learning rate. Default: 0.0001
-`num_snapshots`: number of save models. Default: -1, means 5 which same to pretrained model
-`frozen_layers`: frozen layers. must be a list. If you don't want to frozen any layer, set [].
## Get Started
For Chinese data finetune, execute `./run.sh`. For English data finetune, execute `./run_en.sh`. </br>
Run the command below to
1.**source path**.
2. finetune the model.
...
...
@@ -102,76 +197,59 @@ You can choose a range of stages you want to run, or set `stage` equal to `stop-
Finetune a FastSpeech2 model.
```bash
./run.sh --stage 0 --stop-stage0
./run.sh --stage 0 --stop-stage5
```
`stage 0` of `run.sh` calls `finetune.py`, here's the complete help message.
`stage 5` of `run.sh` calls `local/finetune.py`, here's the complete help message.
the batch size of finetune, default -1 means same as pretrained model
Directory to save finetune model
--ngpu NGPU The number of gpu, if ngpu=0, use cpu
--epoch EPOCH The epoch of finetune
--finetune_config FINETUNE_CONFIG
Path to finetune config file
```
1.`--input_dir` is the directory containing audio and label file.
2.`--pretrained_model_dir` is the directory incluing pretrained fastspeech2_aishell3 model.
3.`--mfa_dir` is the directory to save the results of aligning from pretrained MFA_aishell3 model.
4.`--dump_dir` is the directory including audio feature and metadata.
5.`--output_dir` is the directory to save finetune model.
6.`--lang` is the language of input audio, zh or en.
7.`--ngpu` is the number of gpu.
8.`--epoch` is the epoch of finetune.
9.`--batch_size` is the batch size of finetune.
1.`--pretrained_model_dir` is the directory incluing pretrained fastspeech2_aishell3 model.
2.`--dump_dir` is the directory including audio feature and metadata.
3.`--output_dir` is the directory to save finetune model.
4.`--ngpu` is the number of gpu, if ngpu=0, use cpu
5.`--epoch` is the epoch of finetune.
6.`--finetune_config` is the path to finetune config file
### Synthesizing
We use [HiFiGAN](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc5) as the neural vocoder.
To synthesize Chinese audio, We use [HiFiGAN with aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc5) as the neural vocoder.
Assume the path to the hifigan model is `./pretrained_models`. Download the pretrained HiFiGAN model from [hifigan_aishell3_ckpt_0.2.0](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip) and unzip it.
To synthesize English audio, We use [HiFiGAN with vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/voc5) as the neural vocoder.
Assume the path to the hifigan model is `./pretrained_models`. Download the pretrained HiFiGAN model from [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip) and unzip it.
HiFiGAN checkpoint contains files listed below.
```text
hifigan_aishell3_ckpt_0.2.0
├── default.yaml # default config used to train HiFiGAN
├── feats_stats.npy # statistics used to normalize spectrogram when training HiFiGAN
└── snapshot_iter_2500000.pdz # generator parameters of HiFiGAN
```
Modify `ckpt` in `run.sh` to the final model in `exp/default/checkpoints`.
```bash
./run.sh --stage1 --stop-stage 1
./run.sh --stage6 --stop-stage 6
```
`stage 1` of `run.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file.
`stage 6` of `run.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file.
1.`--am` is acoustic model type with the format {model_name}_{dataset}
2.`--am_config`, `--am_ckpt`, `--am_stat`, `--phones_dict``--speaker_dict` are arguments for acoustic model, which correspond to the 5 files in the fastspeech2 pretrained model.
3.`--voc` is vocoder type with the format {model_name}_{dataset}
...
...
@@ -219,5 +298,8 @@ optional arguments:
7.`--output_dir` is the directory to save synthesized audio files.
8.`--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
### Tips
If you want to get better audio quality, you can use more audios to finetune.
If you want to get better audio quality, you can use more audios to finetune or change configuration parameters in `conf/finetune.yaml`.</br>
More finetune results can be found on [finetune-fastspeech2-for-csmsc](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html#finetune-fastspeech2-for-csmsc).</br>
The results show the effect on csmsc_mini: Freeze encoder > Non Frozen > Freeze encoder && duration_predictor.