diff --git a/examples/aishell/asr1/README.md b/examples/aishell/asr1/README.md index 8c53f95f67514ad8ba5f9050b4d5e1aa3651ecbc..da753634a4f046d60758ccc49ca95c00d46b18b1 100644 --- a/examples/aishell/asr1/README.md +++ b/examples/aishell/asr1/README.md @@ -28,4 +28,4 @@ Need set `decoding.decoding_chunk_size=16` when decoding. | transformer | 31.95M | conf/transformer.yaml | spec_aug | test | attention | 3.858648955821991 | 0.057293 | | transformer | 31.95M | conf/transformer.yaml | spec_aug | test | ctc_greedy_search | 3.858648955821991 | 0.061837 | | transformer | 31.95M | conf/transformer.yaml | spec_aug | test | ctc_prefix_beam_search | 3.858648955821991 | 0.061685 | -| transformer | 31.95M | conf/transformer.yaml | spec_aug | test | attention_rescoring | 3.858648955821991 | 0.053844 | \ No newline at end of file +| transformer | 31.95M | conf/transformer.yaml | spec_aug | test | attention_rescoring | 3.858648955821991 | 0.053844 | diff --git a/examples/aishell3/tts3/README.md b/examples/aishell3/tts3/README.md index 056f35ba96dd1342fafdf9bfba7e840662c738cf..eb2cca2e2ef538b4a4d0a0ef363b163ef056766b 100644 --- a/examples/aishell3/tts3/README.md +++ b/examples/aishell3/tts3/README.md @@ -5,7 +5,7 @@ AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpu We use AISHELL-3 to train a multi-speaker fastspeech2 model here. ## Dataset -### Download and Extract the datasaet +### Download and Extract Download AISHELL-3. ```bash wget https://www.openslr.org/resources/93/data_aishell3.tgz @@ -15,7 +15,7 @@ Extract AISHELL-3. mkdir data_aishell3 tar zxvf data_aishell3.tgz -C data_aishell3 ``` -### Get MFA result of AISHELL-3 and Extract it +### Get MFA Result and Extract We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2. You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) (use MFA1.x now) of our repo. @@ -32,7 +32,12 @@ Run the command below to ```bash ./run.sh ``` -### Preprocess the dataset +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +```bash +./run.sh --stage 0 --stop-stage 0 +``` + +### Data Preprocessing ```bash ./local/preprocess.sh ${conf_path} ``` @@ -58,7 +63,7 @@ The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of whi Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, durations, path of speech features, path of pitch features, path of energy features, speaker and id of each utterance. -### Train the model +### Model Training `./local/train.sh` calls `${BIN_DIR}/train.py`. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} @@ -95,7 +100,7 @@ optional arguments: 5. `--phones-dict` is the path of the phone vocabulary file. 6. `--speaker-dict`is the path of the speaker id map file when training a multi-speaker FastSpeech2. -### Synthesize +### Synthesizing We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1) as the neural vocoder. Download pretrained parallel wavegan model from [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip) and unzip it. ```bash diff --git a/examples/aishell3/tts3/run.sh b/examples/aishell3/tts3/run.sh index 95e4d38fe212f6bdbc9e8ab143671f8a427ecfc8..b375f215984e92ff8acd7ad5f91da67e16863716 100755 --- a/examples/aishell3/tts3/run.sh +++ b/examples/aishell3/tts3/run.sh @@ -11,7 +11,7 @@ conf_path=conf/default.yaml train_output_path=exp/default ckpt_name=snapshot_iter_482.pdz -# with the following command, you can choice the stage range you want to run +# with the following command, you can choose the stage range you want to run # such as `./run.sh --stage 0 --stop-stage 0` # this can not be mixed use with `$1`, `$2` ... source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 diff --git a/examples/aishell3/vc0/README.md b/examples/aishell3/vc0/README.md index 376d4a3317d0b5e139f7c31e2efdc2fc5c0fafc3..fa5c6694196b07cab186601373030317bdc87f95 100644 --- a/examples/aishell3/vc0/README.md +++ b/examples/aishell3/vc0/README.md @@ -16,11 +16,15 @@ Run the command below to ```bash ./run.sh ``` -### Preprocess the dataset +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +```bash +./run.sh --stage 0 --stop-stage 0 +``` +### Data Preprocessing ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/preprocess.sh ${input} ${preprocess_path} ${alignment} ${ge2e_ckpt_path} ``` -#### generate speaker embedding +#### Generate Speaker Embedding Use pretrained GE2E (speaker encoder) to generate speaker embedding for each sentence in AISHELL-3, which has the same file structure with wav files and the format is `.npy`. ```bash @@ -34,7 +38,7 @@ fi ``` The computing time of utterance embedding can be x hours. -#### process wav +#### Process Wav There are silence in the edge of AISHELL-3's wavs, and the audio amplitude is very small, so, we need to remove the silence and normalize the audio. You can the silence remove method based on volume or energy, but the effect is not very good, We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get the alignment of text and speech, then utilize the alignment results to remove the silence. We use Montreal Force Aligner 1.0. The label in aishell3 include pinyin,so the lexicon we provided to MFA is pinyin rather than Chinese characters. And the prosody marks(`$` and `%`) need to be removed. You shoud preprocess the dataset into the format which MFA needs, the texts have the same name with wavs and have the suffix `.lab`. @@ -53,7 +57,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then fi ``` -#### preprocess transcription +#### Preprocess Transcription We revert the transcription into `phones` and `tones`. It is worth noting that our processing here is different from that used for MFA, we separated the tones. This is a processing method, of course, you can only segment initials and vowels. ```bash @@ -64,7 +68,7 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then fi ``` The default input is `~/datasets/data_aishell3/train`,which contains `label_train-set.txt`, the processed results are `metadata.yaml` and `metadata.pickle`. the former is a text format for easy viewing, and the latter is a binary format for direct reading. -#### extract mel +#### Extract Mel ```python if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then python3 ${BIN_DIR}/extract_mel.py \ @@ -73,7 +77,7 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then fi ``` -### Train the model +### Model Training ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${preprocess_path} ${train_output_path} ``` diff --git a/examples/aishell3/vc0/run.sh b/examples/aishell3/vc0/run.sh index 8d3da78135c829b1c95b521dd528c94ac7525a9b..870360c1c3372d58f75384f921504d8ff0f4e2df 100755 --- a/examples/aishell3/vc0/run.sh +++ b/examples/aishell3/vc0/run.sh @@ -23,7 +23,7 @@ waveflow_params_path=./waveflow_ljspeech_ckpt_0.3/step-2000000.pdparams vc_input=ref_audio vc_output=syn_audio -# with the following command, you can choice the stage range you want to run +# with the following command, you can choose the stage range you want to run # such as `./run.sh --stage 0 --stop-stage 0` # this can not be mixed use with `$1`, `$2` ... source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 diff --git a/examples/aishell3/vc1/README.md b/examples/aishell3/vc1/README.md index ae53443efe0471f200468b8a404aa29d55915dc0..635cde896e8a06b3607b733af02c509955abfc53 100644 --- a/examples/aishell3/vc1/README.md +++ b/examples/aishell3/vc1/README.md @@ -5,7 +5,7 @@ This example contains code used to train a [FastSpeech2](https://arxiv.org/abs/2 3. Vocoder: We use [Parallel Wave GAN](http://arxiv.org/abs/1910.11480) as the neural Vocoder, refer to [voc1](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1). ## Dataset -### Download and Extract the datasaet +### Download and Extract Download AISHELL-3. ```bash wget https://www.openslr.org/resources/93/data_aishell3.tgz @@ -15,11 +15,11 @@ Extract AISHELL-3. mkdir data_aishell3 tar zxvf data_aishell3.tgz -C data_aishell3 ``` -### Get MFA result of AISHELL-3 and Extract it +### Get MFA Result and Extract We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2. You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) (use MFA1.x now) of our repo. -## Pretrained GE2E model +## Pretrained GE2E Model We use pretrained GE2E model to generate spwaker embedding for each sentence. Download pretrained GE2E model from here [ge2e_ckpt_0.3.zip](https://bj.bcebos.com/paddlespeech/Parakeet/released_models/ge2e/ge2e_ckpt_0.3.zip), and `unzip` it. @@ -38,7 +38,11 @@ Run the command below to ```bash ./run.sh ``` -### Preprocess the dataset +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +```bash +./run.sh --stage 0 --stop-stage 0 +``` +### Data Preprocessing ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/preprocess.sh ${conf_path} ${ge2e_ckpt_path} ``` @@ -75,14 +79,14 @@ Also there is a `metadata.jsonl` in each subfolder. It is a table-like file whic The preprocessing step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but there is one more `ge2e/inference` step here. -### Train the model +### Model Training `./local/train.sh` calls `${BIN_DIR}/train.py`. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} ``` The training step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but we should set `--voice-cloning=True` when calling `${BIN_DIR}/train.py`. -### Synthesize +### Synthesizing We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1) as the neural vocoder. Download pretrained parallel wavegan model from [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip) and unzip it. ```bash diff --git a/examples/aishell3/vc1/run.sh b/examples/aishell3/vc1/run.sh index 4eae1bdd87fadfe1663202cbd9700620e64cea37..64f4ee3bc89b2c14e0850cbe22b1bbd85962a4a3 100755 --- a/examples/aishell3/vc1/run.sh +++ b/examples/aishell3/vc1/run.sh @@ -18,7 +18,7 @@ ge2e_ckpt_path=./ge2e_ckpt_0.3/step-3000000 # include ".pdparams" here ge2e_params_path=${ge2e_ckpt_path}.pdparams -# with the following command, you can choice the stage range you want to run +# with the following command, you can choose the stage range you want to run # such as `./run.sh --stage 0 --stop-stage 0` # this can not be mixed use with `$1`, `$2` ... source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 diff --git a/examples/aishell3/voc1/README.md b/examples/aishell3/voc1/README.md index bc28bba10f8773c4822c6153a2945804bd43700a..6ee6d39b11265e19248544c47bece40141753858 100644 --- a/examples/aishell3/voc1/README.md +++ b/examples/aishell3/voc1/README.md @@ -3,7 +3,7 @@ This example contains code used to train a [parallel wavegan](http://arxiv.org/a AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems. ## Dataset -### Download and Extract the datasaet +### Download and Extract Download AISHELL-3. ```bash wget https://www.openslr.org/resources/93/data_aishell3.tgz @@ -13,7 +13,7 @@ Extract AISHELL-3. mkdir data_aishell3 tar zxvf data_aishell3.tgz -C data_aishell3 ``` -### Get MFA result of AISHELL-3 and Extract it +### Get MFA Result and Extract We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2. You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) (use MFA1.x now) of our repo. @@ -29,7 +29,11 @@ Run the command below to ```bash ./run.sh ``` -### Preprocess the dataset +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +```bash +./run.sh --stage 0 --stop-stage 0 +``` +### Data Preprocessing ```bash ./local/preprocess.sh ${conf_path} ``` @@ -53,7 +57,7 @@ The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of whi Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance. -### Train the model +### Model Training ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} ``` @@ -100,7 +104,7 @@ benchmark: 3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. -### Synthesize +### Synthesizing `./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} diff --git a/examples/aishell3/voc1/run.sh b/examples/aishell3/voc1/run.sh index 7d0fdb21e32ac7d0a66cc9598884baa5954a5707..4f426ea02e11e5e24663985d3966ec0eaf67a740 100755 --- a/examples/aishell3/voc1/run.sh +++ b/examples/aishell3/voc1/run.sh @@ -11,7 +11,7 @@ conf_path=conf/default.yaml train_output_path=exp/default ckpt_name=snapshot_iter_5000.pdz -# with the following command, you can choice the stage range you want to run +# with the following command, you can choose the stage range you want to run # such as `./run.sh --stage 0 --stop-stage 0` # this can not be mixed use with `$1`, `$2` ... source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 diff --git a/examples/csmsc/tts2/README.md b/examples/csmsc/tts2/README.md index 5ebf3cf4e0ce641be92f6c92ad78c38a4a2defd9..80e50fe93ab8e8f88704bfb8387a64725242e989 100644 --- a/examples/csmsc/tts2/README.md +++ b/examples/csmsc/tts2/README.md @@ -2,10 +2,10 @@ This example contains code used to train a [SpeedySpeech](http://arxiv.org/abs/2008.03802) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html). NOTE that we only implement the student part of the Speedyspeech model. The ground truth alignment used to train the model is extracted from the dataset using [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner). ## Dataset -### Download and Extract the datasaet +### Download and Extract Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/source). -### Get MFA result of CSMSC and Extract it +### Get MFA Result and Extract We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for SPEEDYSPEECH. You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo. @@ -23,7 +23,11 @@ Run the command below to ```bash ./run.sh ``` -### Preprocess the dataset +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +```bash +./run.sh --stage 0 --stop-stage 0 +``` +### Data Preprocessing ```bash ./local/preprocess.sh ${conf_path} ``` @@ -47,7 +51,7 @@ The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of whi Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, tones, durations, path of spectrogram, and id of each utterance. -### Train the model +### Model Training `./local/train.sh` calls `${BIN_DIR}/train.py`. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1 @@ -88,7 +92,7 @@ optional arguments: 5. `--phones-dict` is the path of the phone vocabulary file. 6. `--tones-dict` is the path of the tone vocabulary file. -### Synthesize +### Synthesizing We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc1) as the neural vocoder. Download pretrained parallel wavegan model from [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip) and unzip it. ```bash @@ -200,7 +204,7 @@ optional arguments: 7. `--phones-dict` is the path of the phone vocabulary file. 8. `--tones-dict` is the path of the tone vocabulary file. -### Inference +### Inferencing After Synthesize, we will get static models of speedyspeech and pwgan in `${train_output_path}/inference`. `./local/inference.sh` calls `${BIN_DIR}/inference.py`, which provides a paddle static model inference example for speedyspeech + pwgan synthesize. ```bash diff --git a/examples/csmsc/tts2/run.sh b/examples/csmsc/tts2/run.sh index 200e81929698e59222043953d5cc9a154a2d0a64..8b8f53bd0c7eab15581847505dca7e13577dfb97 100755 --- a/examples/csmsc/tts2/run.sh +++ b/examples/csmsc/tts2/run.sh @@ -11,7 +11,7 @@ conf_path=conf/default.yaml train_output_path=exp/default ckpt_name=snapshot_iter_76.pdz -# with the following command, you can choice the stage range you want to run +# with the following command, you can choose the stage range you want to run # such as `./run.sh --stage 0 --stop-stage 0` # this can not be mixed use with `$1`, `$2` ... source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 diff --git a/examples/csmsc/tts3/README.md b/examples/csmsc/tts3/README.md index 104964c85aee16ba33bf5769adc4433c9b7675ef..c99690e1f73322920bc02803b053faa5e38e3a0c 100644 --- a/examples/csmsc/tts3/README.md +++ b/examples/csmsc/tts3/README.md @@ -2,10 +2,10 @@ This example contains code used to train a [Fastspeech2](https://arxiv.org/abs/2006.04558) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html). ## Dataset -### Download and Extract the datasaet +### Download and Extract Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/source). -### Get MFA result of CSMSC and Extract it +### Get MFA Result and Extract We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2. You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo. @@ -23,7 +23,11 @@ Run the command below to ```bash ./run.sh ``` -### Preprocess the dataset +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +```bash +./run.sh --stage 0 --stop-stage 0 +``` +### Data Preprocessing ```bash ./local/preprocess.sh ${conf_path} ``` @@ -50,7 +54,7 @@ The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of whi Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, durations, path of speech features, path of pitch features, path of energy features, speaker and id of each utterance. -### Train the model +### Model Training ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} ``` @@ -86,7 +90,7 @@ optional arguments: 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 5. `--phones-dict` is the path of the phone vocabulary file. -### Synthesize +### Synthesizing We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc1) as the neural vocoder. Download pretrained parallel wavegan model from [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip) and unzip it. ```bash @@ -191,7 +195,7 @@ optional arguments: 5. `--output-dir` is the directory to save synthesized audio files. 6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. -### Inference +### Inferencing After Synthesize, we will get static models of fastspeech2 and pwgan in `${train_output_path}/inference`. `./local/inference.sh` calls `${BIN_DIR}/inference.py`, which provides a paddle static model inference example for fastspeech2 + pwgan synthesize. ```bash diff --git a/examples/csmsc/tts3/run.sh b/examples/csmsc/tts3/run.sh index 718d6076041f8e8855bad5801383e691f175eee3..c1ddd3b98629c8b645222db6754bf031b97f3712 100755 --- a/examples/csmsc/tts3/run.sh +++ b/examples/csmsc/tts3/run.sh @@ -11,7 +11,7 @@ conf_path=conf/default.yaml train_output_path=exp/default ckpt_name=snapshot_iter_153.pdz -# with the following command, you can choice the stage range you want to run +# with the following command, you can choose the stage range you want to run # such as `./run.sh --stage 0 --stop-stage 0` # this can not be mixed use with `$1`, `$2` ... source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 diff --git a/examples/csmsc/voc1/README.md b/examples/csmsc/voc1/README.md index 86114a42338a859a3f81ab43cfebc9403b9e5794..9d516be43ee91b527a763cbedd1aca82641219a1 100644 --- a/examples/csmsc/voc1/README.md +++ b/examples/csmsc/voc1/README.md @@ -1,11 +1,11 @@ # Parallel WaveGAN with CSMSC This example contains code used to train a [parallel wavegan](http://arxiv.org/abs/1910.11480) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html). ## Dataset -### Download and Extract the datasaet +### Download and Extract Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/BZNSYP`. -### Get MFA results for silence trim -We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. +### Get MFA Result and Extract +We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo. ## Get Started @@ -20,7 +20,11 @@ Run the command below to ```bash ./run.sh ``` -### Preprocess the dataset +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +```bash +./run.sh --stage 0 --stop-stage 0 +``` +### Data Preprocessing ```bash ./local/preprocess.sh ${conf_path} ``` @@ -43,7 +47,7 @@ The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of whi Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance. -### Train the model +### Model Training ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} ``` @@ -90,7 +94,7 @@ benchmark: 3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. -### Synthesize +### Synthesizing `./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} diff --git a/examples/csmsc/voc1/run.sh b/examples/csmsc/voc1/run.sh index 16309543948c1a4de048e977639ddde86c4769b2..cab1ac38b1a3cb1c984ad1f570160d5852ce1697 100755 --- a/examples/csmsc/voc1/run.sh +++ b/examples/csmsc/voc1/run.sh @@ -11,7 +11,7 @@ conf_path=conf/default.yaml train_output_path=exp/default ckpt_name=snapshot_iter_5000.pdz -# with the following command, you can choice the stage range you want to run +# with the following command, you can choose the stage range you want to run # such as `./run.sh --stage 0 --stop-stage 0` # this can not be mixed use with `$1`, `$2` ... source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 diff --git a/examples/csmsc/voc3/README.md b/examples/csmsc/voc3/README.md index 4925b649d747f8d5d6124cf8eb20564a28f94f11..0a64d1a18477bc1d703cb691847d7631e2da5338 100644 --- a/examples/csmsc/voc3/README.md +++ b/examples/csmsc/voc3/README.md @@ -1,11 +1,11 @@ # Multi Band MelGAN with CSMSC This example contains code used to train a [Multi Band MelGAN](https://arxiv.org/abs/2005.05106) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html). ## Dataset -### Download and Extract the datasaet +### Download and Extract Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/BZNSYP`. -### Get MFA results for silence trim -We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. +### Get MFA Result and Extract +We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/use_mfa) of our repo. ## Get Started @@ -20,7 +20,11 @@ Run the command below to ```bash ./run.sh ``` -### Preprocess the dataset +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +```bash +./run.sh --stage 0 --stop-stage 0 +``` +### Data Preprocessing ```bash ./local/preprocess.sh ${conf_path} ``` @@ -43,7 +47,7 @@ The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of whi Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance. -### Train the model +### Model Training ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} ``` @@ -75,7 +79,7 @@ optional arguments: 3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. -### Synthesize +### Synthesizing `./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} @@ -106,7 +110,7 @@ optional arguments: 4. `--output-dir` is the directory to save the synthesized audio files. 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. -## Finetune +## Fine-tuning Since there are no `noise` in the input of Multi Band MelGAN, the audio quality is not so good (see [espnet issue](https://github.com/espnet/espnet/issues/3536#issuecomment-916035415)), we refer to the method proposed in [HiFiGAN](https://arxiv.org/abs/2010.05646), finetune Multi Band MelGAN with the predicted mel-spectrogram from `FastSpeech2`. The length of mel-spectrograms should align with the length of wavs, so we should generate mels using ground truth alignment. @@ -144,7 +148,7 @@ Run the command below By default, `finetune.sh` will use `conf/finetune.yaml` as config, the dump-dir is `dump_finetune`, the experiment dir is `exp/finetune`. TODO: -The hyperparameter of `finetune.yaml` is not good enough, a smaller `learning_rate` should be used (more `milestones` should be set). +The hyperparameter of `finetune.yaml` is not good enough, a smaller `learning_rate` should be used (more `milestones` should be set). ## Pretrained Models Pretrained model can be downloaded here [mb_melgan_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_ckpt_0.5.zip). diff --git a/examples/csmsc/voc3/run.sh b/examples/csmsc/voc3/run.sh index 360f6ec2a23379dc3477042c878617537e092081..3e7d7e2ab61882a7985c0de110803815071afd1a 100755 --- a/examples/csmsc/voc3/run.sh +++ b/examples/csmsc/voc3/run.sh @@ -11,7 +11,7 @@ conf_path=conf/default.yaml train_output_path=exp/default ckpt_name=snapshot_iter_50000.pdz -# with the following command, you can choice the stage range you want to run +# with the following command, you can choose the stage range you want to run # such as `./run.sh --stage 0 --stop-stage 0` # this can not be mixed use with `$1`, `$2` ... source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 diff --git a/examples/ljspeech/tts0/README.md b/examples/ljspeech/tts0/README.md index 305add2042b0b62a8cb3ca57e05230fdf5778d46..d49d6d6b60103b74b165f493895e2f306e14caec 100644 --- a/examples/ljspeech/tts0/README.md +++ b/examples/ljspeech/tts0/README.md @@ -1,4 +1,4 @@ -# Tacotron2 with LJSpeech +# Tacotron2 with LJSpeech PaddlePaddle dynamic graph implementation of Tacotron2, a neural network architecture for speech synthesis directly from text. The implementation is based on [Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](https://arxiv.org/abs/1712.05884). ## Dataset @@ -18,11 +18,15 @@ Run the command below to ```bash ./run.sh ``` -### Preprocess the dataset +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +```bash +./run.sh --stage 0 --stop-stage 0 +``` +### Data Preprocessing ```bash ./local/preprocess.sh ${conf_path} ``` -### Train the model +### Model Training `./local/train.sh` calls `${BIN_DIR}/train.py`. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} @@ -51,7 +55,7 @@ By default, training will be resumed from the latest checkpoint in `--output`, i And if you want to resume from an other existing model, you should set `checkpoint_path` to be the checkpoint path you want to load. **Note: The checkpoint path cannot contain the file extension.** -### Synthesize +### Synthesizing `./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which synthesize **mels** from text_list here. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${train_output_path} ${ckpt_name} diff --git a/examples/ljspeech/tts0/run.sh b/examples/ljspeech/tts0/run.sh index 1da80c962516739b251785a8811337121e8f439f..47c76c3d2c8b766802b81ca0e159623b709eba84 100755 --- a/examples/ljspeech/tts0/run.sh +++ b/examples/ljspeech/tts0/run.sh @@ -11,7 +11,7 @@ preprocess_path=preprocessed_ljspeech train_output_path=output ckpt_name=step-35000 -# with the following command, you can choice the stage range you want to run +# with the following command, you can choose the stage range you want to run # such as `./run.sh --stage 0 --stop-stage 0` # this can not be mixed use with `$1`, `$2` ... source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 diff --git a/examples/ljspeech/tts1/README.md b/examples/ljspeech/tts1/README.md index 8a43ecd9c2b7685f4457295b2e4c0e4da066d246..c2d0c59e821c5ad9d3116f05370dcd44b0302961 100644 --- a/examples/ljspeech/tts1/README.md +++ b/examples/ljspeech/tts1/README.md @@ -1,11 +1,9 @@ # TransformerTTS with LJSpeech ## Dataset -### Download the datasaet +We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/). + ```bash wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 -``` -### Extract the dataset -```bash tar xjvf LJSpeech-1.1.tar.bz2 ``` ## Get Started @@ -20,7 +18,11 @@ Run the command below to ```bash ./run.sh ``` -### Preprocess the dataset +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +```bash +./run.sh --stage 0 --stop-stage 0 +``` +### Data Preprocessing ```bash ./local/preprocess.sh ${conf_path} ``` @@ -44,7 +46,7 @@ The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of whi Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, path of speech features, speaker and id of each utterance. -### Train the model +### Model Training `./local/train.sh` calls `${BIN_DIR}/train.py`. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} @@ -77,7 +79,7 @@ optional arguments: 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 5. `--phones-dict` is the path of the phone vocabulary file. -## Synthesize +## Synthesizing We use [waveflow](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0) as the neural vocoder. Download Pretrained WaveFlow Model with residual channel equals 128 from [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/waveflow/waveflow_ljspeech_ckpt_0.3.zip) and unzip it. ```bash diff --git a/examples/ljspeech/tts1/run.sh b/examples/ljspeech/tts1/run.sh index 6e7a60607190c438d023b20d794cbf7317acdab7..48c4c915114beb361106ea3c9986566dc7f7c19e 100755 --- a/examples/ljspeech/tts1/run.sh +++ b/examples/ljspeech/tts1/run.sh @@ -11,7 +11,7 @@ conf_path=conf/default.yaml train_output_path=exp/default ckpt_name=snapshot_iter_403.pdz -# with the following command, you can choice the stage range you want to run +# with the following command, you can choose the stage range you want to run # such as `./run.sh --stage 0 --stop-stage 0` # this can not be mixed use with `$1`, `$2` ... source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 diff --git a/examples/ljspeech/tts3/README.md b/examples/ljspeech/tts3/README.md index 5bdaf4b82621e548553d05754a4950d2745d6096..bb5c7a69ed070c1882f0dbee5dc90f57cf619555 100644 --- a/examples/ljspeech/tts3/README.md +++ b/examples/ljspeech/tts3/README.md @@ -2,10 +2,10 @@ This example contains code used to train a [Fastspeech2](https://arxiv.org/abs/2006.04558) model with [LJSpeech-1.1](https://keithito.com/LJ-Speech-Dataset/). ## Dataset -### Download and Extract the datasaet +### Download and Extract Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech-Dataset/). -### Get MFA result of LJSpeech-1.1 and Extract it +### Get MFA Result and Extract We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2. You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo. @@ -22,7 +22,11 @@ Run the command below to ```bash ./run.sh ``` -### Preprocess the dataset +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +```bash +./run.sh --stage 0 --stop-stage 0 +``` +### Data Preprocessing ```bash ./local/preprocess.sh ${conf_path} ``` @@ -49,7 +53,7 @@ The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of whi Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, durations, path of speech features, path of pitch features, path of energy features, speaker and id of each utterance. -### Train the model +### Model Training `./local/train.sh` calls `${BIN_DIR}/train.py`. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} @@ -85,7 +89,7 @@ optional arguments: 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 5. `--phones-dict` is the path of the phone vocabulary file. -### Synthesize +### Synthesizing We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc1) as the neural vocoder. Download pretrained parallel wavegan model from [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip) and unzip it. ```bash diff --git a/examples/ljspeech/tts3/run.sh b/examples/ljspeech/tts3/run.sh index 143debd2a4c05fe825f739faf9a8496c8a686c9b..c64fa8883220db1b019d56056fe7c06033176573 100755 --- a/examples/ljspeech/tts3/run.sh +++ b/examples/ljspeech/tts3/run.sh @@ -11,7 +11,7 @@ conf_path=conf/default.yaml train_output_path=exp/default ckpt_name=snapshot_iter_201.pdz -# with the following command, you can choice the stage range you want to run +# with the following command, you can choose the stage range you want to run # such as `./run.sh --stage 0 --stop-stage 0` # this can not be mixed use with `$1`, `$2` ... source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 diff --git a/examples/ljspeech/voc0/README.md b/examples/ljspeech/voc0/README.md index 0d4e6c51a00dc1e91238545382972ad977f5d718..725eb617bfd0e88fb9b7628e255a8e217d409607 100644 --- a/examples/ljspeech/voc0/README.md +++ b/examples/ljspeech/voc0/README.md @@ -1,11 +1,9 @@ # WaveFlow with LJSpeech ## Dataset -### Download the datasaet. +We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/). + ```bash wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 -``` -### Extract the dataset. -```bash tar xjvf LJSpeech-1.1.tar.bz2 ``` ## Get Started @@ -19,11 +17,15 @@ Run the command below to ```bash ./run.sh ``` -### Preprocess the dataset. +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +```bash +./run.sh --stage 0 --stop-stage 0 +``` +### Data Preprocessing ```bash ./local/preprocess.sh ${preprocess_path} ``` -### Train the model +### Model Training `./local/train.sh` calls `${BIN_DIR}/train.py`. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${preprocess_path} ${train_output_path} @@ -35,7 +37,7 @@ The training script requires 4 command line arguments. If you want distributed training, set a larger `--ngpu` (e.g. 4). Note that distributed training with cpu is not supported yet. -### Synthesize +### Synthesizing `./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from mels. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${input_mel_path} ${train_output_path} ${ckpt_name} diff --git a/examples/ljspeech/voc0/run.sh b/examples/ljspeech/voc0/run.sh index a4f1ac389cf244cc68609e809ff6713b22f0f06b..ddd82cb44936431212aafb7a3b7fbca47eac0860 100755 --- a/examples/ljspeech/voc0/run.sh +++ b/examples/ljspeech/voc0/run.sh @@ -13,7 +13,7 @@ train_output_path=output input_mel_path=../tts0/output/test ckpt_name=step-10000 -# with the following command, you can choice the stage range you want to run +# with the following command, you can choose the stage range you want to run # such as `./run.sh --stage 0 --stop-stage 0` # this can not be mixed use with `$1`, `$2` ... source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 diff --git a/examples/ljspeech/voc1/README.md b/examples/ljspeech/voc1/README.md index 24f6dbcafe7086230ea6c05c9a8a7624f3cf142a..7cb69b154713c787c48036030f97bd7d4ce089bc 100644 --- a/examples/ljspeech/voc1/README.md +++ b/examples/ljspeech/voc1/README.md @@ -1,10 +1,10 @@ # Parallel WaveGAN with the LJSpeech-1.1 This example contains code used to train a [parallel wavegan](http://arxiv.org/abs/1910.11480) model with [LJSpeech-1.1](https://keithito.com/LJ-Speech-Dataset/). ## Dataset -### Download and Extract the datasaet +### Download and Extract Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech-Dataset/). -### Get MFA results for silence trim -We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. +### Get MFA Result and Extract +We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo. ## Get Started @@ -19,8 +19,11 @@ Run the command below to ```bash ./run.sh ``` - -### Preprocess the dataset +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +```bash +./run.sh --stage 0 --stop-stage 0 +``` +### Data Preprocessing ```bash ./local/preprocess.sh ${conf_path} ``` @@ -44,7 +47,7 @@ The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of whi Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance. -### Train the model +### Model Training `./local/train.sh` calls `${BIN_DIR}/train.py`. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} @@ -91,7 +94,7 @@ benchmark: 3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. -### Synthesize +### Synthesizing `./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} @@ -122,7 +125,7 @@ optional arguments: 4. `--output-dir` is the directory to save the synthesized audio files. 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. -## Pretrained Models +## Pretrained Model Pretrained models can be downloaded here. [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip) Parallel WaveGAN checkpoint contains files listed below. diff --git a/examples/ljspeech/voc1/run.sh b/examples/ljspeech/voc1/run.sh index 16309543948c1a4de048e977639ddde86c4769b2..cab1ac38b1a3cb1c984ad1f570160d5852ce1697 100755 --- a/examples/ljspeech/voc1/run.sh +++ b/examples/ljspeech/voc1/run.sh @@ -11,7 +11,7 @@ conf_path=conf/default.yaml train_output_path=exp/default ckpt_name=snapshot_iter_5000.pdz -# with the following command, you can choice the stage range you want to run +# with the following command, you can choose the stage range you want to run # such as `./run.sh --stage 0 --stop-stage 0` # this can not be mixed use with `$1`, `$2` ... source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 diff --git a/examples/other/ge2e/README.md b/examples/other/ge2e/README.md index d58ca5137b6f39d2b38a5c9cb73872dcd9e5f7ff..2b3f91b52b1b9e03526def1f89b6444d4b58c2bf 100644 --- a/examples/other/ge2e/README.md +++ b/examples/other/ge2e/README.md @@ -24,8 +24,11 @@ If you want to use other datasets, you can also download and preprocess it as lo ```bash ./run.sh ``` - -### Preprocess Datasets +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +```bash +./run.sh --stage 0 --stop-stage 0 +``` +### Data Preprocessing `./local/preprocess.sh` calls `${BIN_DIR}/preprocess.py`. ```bash ./local/preprocess.sh ${datasets_root} ${preprocess_path} ${dataset_names} @@ -62,7 +65,7 @@ In `${BIN_DIR}/preprocess.py`: 2. `--output_dir` is the directory to save the preprocessed dataset 3. `--dataset_names` is the dataset to preprocess. If there are multiple datasets in `--datasets_root` to preprocess, the names can be joined with comma. Currently supported dataset names are librispeech_other, voxceleb1, voxceleb2, aidatatang_200zh and magicdata. -### Train the model +### Model Training `./local/train.sh` calls `${BIN_DIR}/train.py`. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${preprocess_path} ${train_output_path} @@ -79,7 +82,7 @@ Other options are described below. - `--opts` is command line options to further override config files. It should be the last comman line options passed with multiple key-value pairs separated by spaces. - `--checkpoint_path` specifies the checkpoiont to load before training, extension is not included. A parameter file ( `.pdparams`) and an optimizer state file ( `.pdopt`) with the same name is used. This option has a higher priority than auto-resuming from the `--output` directory. -### Inference +### Inferencing When training is done, run the command below to generate utterance embedding for each utterance in a dataset. `./local/inference.sh` calls `${BIN_DIR}/inference.py`. ```bash diff --git a/examples/other/ge2e/run.sh b/examples/other/ge2e/run.sh index d7954bd2fbfb69e8ad6196c61b43b34ac3c49e1a..e69b34a7c032c31886ce5e1565bb8084bec23eb3 100755 --- a/examples/other/ge2e/run.sh +++ b/examples/other/ge2e/run.sh @@ -15,7 +15,7 @@ infer_input=infer_input infer_output=infer_output ckpt_name=step-10000 -# with the following command, you can choice the stage range you want to run +# with the following command, you can choose the stage range you want to run # such as `./run.sh --stage 0 --stop-stage 0` # this can not be mixed use with `$1`, `$2` ... source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 diff --git a/examples/vctk/tts3/README.md b/examples/vctk/tts3/README.md index 894d6b1476243736c0b540c285ae72824147f573..aab005735134dc5514cc9ee18edc30240ebaf1b4 100644 --- a/examples/vctk/tts3/README.md +++ b/examples/vctk/tts3/README.md @@ -5,7 +5,7 @@ This example contains code used to train a [Fastspeech2](https://arxiv.org/abs/2 ### Download and Extract the datasaet Download VCTK-0.92 from the [official website](https://datashare.ed.ac.uk/handle/10283/3443). -### Get MFA result of VCTK and Extract it +### Get MFA Result and Extract We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2. You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo. ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/use_mfa/local/reorganize_vctk.py)): @@ -25,7 +25,11 @@ Run the command below to ```bash ./run.sh ``` -### Preprocess the dataset +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +```bash +./run.sh --stage 0 --stop-stage 0 +``` +### Data Preprocessing ```bash ./local/preprocess.sh ${conf_path} ``` @@ -52,7 +56,7 @@ The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of whi Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, durations, path of speech features, path of pitch features, path of energy features, speaker and id of each utterance. -### Train the model +### Model Training ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} ``` @@ -87,7 +91,7 @@ optional arguments: 3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. 4. `--phones-dict` is the path of the phone vocabulary file. -### Synthesize +### Synthesizing We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/voc1) as the neural vocoder. Download pretrained parallel wavegan model from [pwg_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.5.zip)and unzip it. diff --git a/examples/vctk/tts3/run.sh b/examples/vctk/tts3/run.sh index 0562ef3f408d64cad4339243c55a470d6d0d29a5..a2b849bc8999bc72f5b6c12d79e44ef2d63005d9 100755 --- a/examples/vctk/tts3/run.sh +++ b/examples/vctk/tts3/run.sh @@ -11,7 +11,7 @@ conf_path=conf/default.yaml train_output_path=exp/default ckpt_name=snapshot_iter_331.pdz -# with the following command, you can choice the stage range you want to run +# with the following command, you can choose the stage range you want to run # such as `./run.sh --stage 0 --stop-stage 0` # this can not be mixed use with `$1`, `$2` ... source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 diff --git a/examples/vctk/voc1/README.md b/examples/vctk/voc1/README.md index 8692f0104358455a7c10617c1f08a8822d861d34..154fd7cdee108e776cb1df7312a0a5757509f44e 100644 --- a/examples/vctk/voc1/README.md +++ b/examples/vctk/voc1/README.md @@ -2,11 +2,11 @@ This example contains code used to train a [parallel wavegan](http://arxiv.org/abs/1910.11480) model with [VCTK](https://datashare.ed.ac.uk/handle/10283/3443). ## Dataset -### Download and Extract the datasaet +### Download and Extract Download VCTK-0.92 from the [official website](https://datashare.ed.ac.uk/handle/10283/3443) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/VCTK-Corpus-0.92`. -### Get MFA results for silence trim -We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. +### Get MFA Result and Extract +We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo. ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/use_mfa/local/reorganize_vctk.py)): 1. `p315`, because no txt for it. @@ -24,7 +24,11 @@ Run the command below to ```bash ./run.sh ``` -### Preprocess the dataset +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +```bash +./run.sh --stage 0 --stop-stage 0 +``` +### Data Preprocessing ```bash ./local/preprocess.sh ${conf_path} ``` @@ -48,7 +52,7 @@ The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of whi Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance. -### Train the model +### Model Training ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} ``` @@ -95,7 +99,7 @@ benchmark: 3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. -### Synthesize +### Synthesizing `./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} @@ -126,7 +130,7 @@ optional arguments: 4. `--output-dir` is the directory to save the synthesized audio files. 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. -## Pretrained Models +## Pretrained Model Pretrained models can be downloaded here [pwg_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.5.zip). Parallel WaveGAN checkpoint contains files listed below. diff --git a/examples/vctk/voc1/run.sh b/examples/vctk/voc1/run.sh index 7d0fdb21e32ac7d0a66cc9598884baa5954a5707..4f426ea02e11e5e24663985d3966ec0eaf67a740 100755 --- a/examples/vctk/voc1/run.sh +++ b/examples/vctk/voc1/run.sh @@ -11,7 +11,7 @@ conf_path=conf/default.yaml train_output_path=exp/default ckpt_name=snapshot_iter_5000.pdz -# with the following command, you can choice the stage range you want to run +# with the following command, you can choose the stage range you want to run # such as `./run.sh --stage 0 --stop-stage 0` # this can not be mixed use with `$1`, `$2` ... source ${MAIN_ROOT}/utils/parse_options.sh || exit 1