提交 3751845d 编写于 作者: Y Yang Zhou

Merge branch 'develop' of github.com:SmileGoat/PaddleSpeech into refactor_file_struct

Conflicts:
	speechx/speechx/codelab/CMakeLists.txt
...@@ -19,7 +19,7 @@ import subprocess ...@@ -19,7 +19,7 @@ import subprocess
import platform import platform
COPYRIGHT = ''' COPYRIGHT = '''
Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License"); Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. you may not use this file except in compliance with the License.
......
...@@ -24,6 +24,8 @@ ...@@ -24,6 +24,8 @@
| <a href="#documents"> Documents </a> | <a href="#documents"> Documents </a>
| <a href="#model-list"> Models List </a> | <a href="#model-list"> Models List </a>
| <a href="https://aistudio.baidu.com/aistudio/education/group/info/25130"> AIStudio Courses </a> | <a href="https://aistudio.baidu.com/aistudio/education/group/info/25130"> AIStudio Courses </a>
| <a href="https://arxiv.org/abs/2205.12007"> Paper </a>
| <a href="https://gitee.com/paddlepaddle/PaddleSpeech"> Gitee </a>
</h4> </h4>
</div> </div>
......
...@@ -25,6 +25,8 @@ ...@@ -25,6 +25,8 @@
| <a href="#教程文档"> 教程文档 </a> | <a href="#教程文档"> 教程文档 </a>
| <a href="#模型列表"> 模型列表 </a> | <a href="#模型列表"> 模型列表 </a>
| <a href="https://aistudio.baidu.com/aistudio/education/group/info/25130"> AIStudio 课程 </a> | <a href="https://aistudio.baidu.com/aistudio/education/group/info/25130"> AIStudio 课程 </a>
| <a href="https://arxiv.org/abs/2205.12007"> 论文 </a>
| <a href="https://gitee.com/paddlepaddle/PaddleSpeech"> Gitee
</h4> </h4>
</div> </div>
......
...@@ -2,14 +2,14 @@ ...@@ -2,14 +2,14 @@
([简体中文](./README_cn.md)|English) ([简体中文](./README_cn.md)|English)
The directory containes many speech applications in multi scenarios. This directory contains many speech applications in multiple scenarios.
* audio searching - mass audio similarity retrieval * audio searching - mass audio similarity retrieval
* audio tagging - multi-label tagging of an audio file * audio tagging - multi-label tagging of an audio file
* automatic_video_subtitiles - generate subtitles from a video * automatic_video_subtitles - generate subtitles from a video
* metaverse - 2D AR with TTS * metaverse - 2D AR with TTS
* punctuation_restoration - restore punctuation from raw text * punctuation_restoration - restore punctuation from raw text
* speech recogintion - recognize text of an audio file * speech recognition - recognize text of an audio file
* speech server - Server for Speech Task, e.g. ASR,TTS,CLS * speech server - Server for Speech Task, e.g. ASR,TTS,CLS
* streaming asr server - receive audio stream from websocket, and recognize to transcript. * streaming asr server - receive audio stream from websocket, and recognize to transcript.
* speech translation - end to end speech translation * speech translation - end to end speech translation
......
...@@ -14,7 +14,7 @@ ...@@ -14,7 +14,7 @@
import numpy as np import numpy as np
from logs import LOGGER from logs import LOGGER
from paddlespeech.cli import VectorExecutor from paddlespeech.cli.vector import VectorExecutor
vector_executor = VectorExecutor() vector_executor = VectorExecutor()
......
...@@ -57,7 +57,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/cat.wav https://paddlespe ...@@ -57,7 +57,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/cat.wav https://paddlespe
- Python API - Python API
```python ```python
import paddle import paddle
from paddlespeech.cli import CLSExecutor from paddlespeech.cli.cls import CLSExecutor
cls_executor = CLSExecutor() cls_executor = CLSExecutor()
result = cls_executor( result = cls_executor(
......
...@@ -57,7 +57,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/cat.wav https://paddlespe ...@@ -57,7 +57,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/cat.wav https://paddlespe
- Python API - Python API
```python ```python
import paddle import paddle
from paddlespeech.cli import CLSExecutor from paddlespeech.cli.cls import CLSExecutor
cls_executor = CLSExecutor() cls_executor = CLSExecutor()
result = cls_executor( result = cls_executor(
......
...@@ -28,7 +28,8 @@ ffmpeg -i subtitle_demo1.mp4 -ac 1 -ar 16000 -vn input.wav ...@@ -28,7 +28,8 @@ ffmpeg -i subtitle_demo1.mp4 -ac 1 -ar 16000 -vn input.wav
- Python API - Python API
```python ```python
import paddle import paddle
from paddlespeech.cli import ASRExecutor, TextExecutor from paddlespeech.cli.asr import ASRExecutor
from paddlespeech.cli.text import TextExecutor
asr_executor = ASRExecutor() asr_executor = ASRExecutor()
text_executor = TextExecutor() text_executor = TextExecutor()
......
...@@ -23,7 +23,8 @@ ffmpeg -i subtitle_demo1.mp4 -ac 1 -ar 16000 -vn input.wav ...@@ -23,7 +23,8 @@ ffmpeg -i subtitle_demo1.mp4 -ac 1 -ar 16000 -vn input.wav
- Python API - Python API
```python ```python
import paddle import paddle
from paddlespeech.cli import ASRExecutor, TextExecutor from paddlespeech.cli.asr import ASRExecutor
from paddlespeech.cli.text import TextExecutor
asr_executor = ASRExecutor() asr_executor = ASRExecutor()
text_executor = TextExecutor() text_executor = TextExecutor()
......
...@@ -16,8 +16,8 @@ import os ...@@ -16,8 +16,8 @@ import os
import paddle import paddle
from paddlespeech.cli import ASRExecutor from paddlespeech.cli.asr import ASRExecutor
from paddlespeech.cli import TextExecutor from paddlespeech.cli.text import TextExecutor
# yapf: disable # yapf: disable
parser = argparse.ArgumentParser(__doc__) parser = argparse.ArgumentParser(__doc__)
......
...@@ -42,7 +42,7 @@ The input of this demo should be a text of the specific language that can be pas ...@@ -42,7 +42,7 @@ The input of this demo should be a text of the specific language that can be pas
- Python API - Python API
```python ```python
import paddle import paddle
from paddlespeech.cli import TextExecutor from paddlespeech.cli.text import TextExecutor
text_executor = TextExecutor() text_executor = TextExecutor()
result = text_executor( result = text_executor(
......
...@@ -44,7 +44,7 @@ ...@@ -44,7 +44,7 @@
- Python API - Python API
```python ```python
import paddle import paddle
from paddlespeech.cli import TextExecutor from paddlespeech.cli.text import TextExecutor
text_executor = TextExecutor() text_executor = TextExecutor()
result = text_executor( result = text_executor(
......
...@@ -96,7 +96,7 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav ...@@ -96,7 +96,7 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
- Python API - Python API
```python ```python
from paddlespeech.cli import VectorExecutor from paddlespeech.cli.vector import VectorExecutor
vector_executor = VectorExecutor() vector_executor = VectorExecutor()
audio_emb = vector_executor( audio_emb = vector_executor(
......
...@@ -95,7 +95,7 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav ...@@ -95,7 +95,7 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
- Python API - Python API
```python ```python
import paddle import paddle
from paddlespeech.cli import VectorExecutor from paddlespeech.cli.vector import VectorExecutor
vector_executor = VectorExecutor() vector_executor = VectorExecutor()
audio_emb = vector_executor( audio_emb = vector_executor(
......
...@@ -58,7 +58,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -58,7 +58,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- Python API - Python API
```python ```python
import paddle import paddle
from paddlespeech.cli import ASRExecutor from paddlespeech.cli.asr import ASRExecutor
asr_executor = ASRExecutor() asr_executor = ASRExecutor()
text = asr_executor( text = asr_executor(
......
...@@ -56,7 +56,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -56,7 +56,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- Python API - Python API
```python ```python
import paddle import paddle
from paddlespeech.cli import ASRExecutor from paddlespeech.cli.asr import ASRExecutor
asr_executor = ASRExecutor() asr_executor = ASRExecutor()
text = asr_executor( text = asr_executor(
......
...@@ -47,7 +47,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -47,7 +47,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- Python API - Python API
```python ```python
import paddle import paddle
from paddlespeech.cli import STExecutor from paddlespeech.cli.st import STExecutor
st_executor = STExecutor() st_executor = STExecutor()
text = st_executor( text = st_executor(
......
...@@ -47,7 +47,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ...@@ -47,7 +47,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- Python API - Python API
```python ```python
import paddle import paddle
from paddlespeech.cli import STExecutor from paddlespeech.cli.st import STExecutor
st_executor = STExecutor() st_executor = STExecutor()
text = st_executor( text = st_executor(
......
...@@ -77,7 +77,7 @@ The input of this demo should be a text of the specific language that can be pas ...@@ -77,7 +77,7 @@ The input of this demo should be a text of the specific language that can be pas
- Python API - Python API
```python ```python
import paddle import paddle
from paddlespeech.cli import TTSExecutor from paddlespeech.cli.tts import TTSExecutor
tts_executor = TTSExecutor() tts_executor = TTSExecutor()
wav_file = tts_executor( wav_file = tts_executor(
......
...@@ -80,7 +80,7 @@ ...@@ -80,7 +80,7 @@
- Python API - Python API
```python ```python
import paddle import paddle
from paddlespeech.cli import TTSExecutor from paddlespeech.cli.tts import TTSExecutor
tts_executor = TTSExecutor() tts_executor = TTSExecutor()
wav_file = tts_executor( wav_file = tts_executor(
......
...@@ -113,12 +113,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_p ...@@ -113,12 +113,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_p
``` ```
```text ```text
usage: synthesize.py [-h] usage: synthesize.py [-h]
[--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}] [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech,tacotron2_aishell3}]
[--am_config AM_CONFIG] [--am_ckpt AM_CKPT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--am_stat AM_STAT] [--phones_dict PHONES_DICT] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT] [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT]
[--voice-cloning VOICE_CLONING] [--voice-cloning VOICE_CLONING]
[--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,wavernn_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,style_melgan_csmsc}]
[--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--ngpu NGPU] [--voc_stat VOC_STAT] [--ngpu NGPU]
[--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR] [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
...@@ -127,11 +127,10 @@ Synthesize with acoustic model & vocoder ...@@ -127,11 +127,10 @@ Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk} --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech,tacotron2_aishell3}
Choose acoustic model type of tts task. Choose acoustic model type of tts task.
--am_config AM_CONFIG --am_config AM_CONFIG
Config of acoustic model. Use deault config when it is Config of acoustic model.
None.
--am_ckpt AM_CKPT Checkpoint file of acoustic model. --am_ckpt AM_CKPT Checkpoint file of acoustic model.
--am_stat AM_STAT mean and standard deviation used to normalize --am_stat AM_STAT mean and standard deviation used to normalize
spectrogram when training acoustic model. spectrogram when training acoustic model.
...@@ -143,10 +142,10 @@ optional arguments: ...@@ -143,10 +142,10 @@ optional arguments:
speaker id map file. speaker id map file.
--voice-cloning VOICE_CLONING --voice-cloning VOICE_CLONING
whether training voice cloning model. whether training voice cloning model.
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc} --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,wavernn_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,style_melgan_csmsc}
Choose vocoder type of tts task. Choose vocoder type of tts task.
--voc_config VOC_CONFIG --voc_config VOC_CONFIG
Config of voc. Use deault config when it is None. Config of voc.
--voc_ckpt VOC_CKPT Checkpoint file of voc. --voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize --voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc. spectrogram when training voc.
...@@ -162,12 +161,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_outp ...@@ -162,12 +161,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_outp
``` ```
```text ```text
usage: synthesize_e2e.py [-h] usage: synthesize_e2e.py [-h]
[--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}] [--am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}]
[--am_config AM_CONFIG] [--am_ckpt AM_CKPT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--am_stat AM_STAT] [--phones_dict PHONES_DICT] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--tones_dict TONES_DICT] [--tones_dict TONES_DICT]
[--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID] [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
[--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}]
[--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--lang LANG] [--voc_stat VOC_STAT] [--lang LANG]
[--inference_dir INFERENCE_DIR] [--ngpu NGPU] [--inference_dir INFERENCE_DIR] [--ngpu NGPU]
...@@ -177,11 +176,10 @@ Synthesize with acoustic model & vocoder ...@@ -177,11 +176,10 @@ Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk} --am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}
Choose acoustic model type of tts task. Choose acoustic model type of tts task.
--am_config AM_CONFIG --am_config AM_CONFIG
Config of acoustic model. Use deault config when it is Config of acoustic model.
None.
--am_ckpt AM_CKPT Checkpoint file of acoustic model. --am_ckpt AM_CKPT Checkpoint file of acoustic model.
--am_stat AM_STAT mean and standard deviation used to normalize --am_stat AM_STAT mean and standard deviation used to normalize
spectrogram when training acoustic model. spectrogram when training acoustic model.
...@@ -192,10 +190,10 @@ optional arguments: ...@@ -192,10 +190,10 @@ optional arguments:
--speaker_dict SPEAKER_DICT --speaker_dict SPEAKER_DICT
speaker id map file. speaker id map file.
--spk_id SPK_ID spk id for multi speaker acoustic model --spk_id SPK_ID spk id for multi speaker acoustic model
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc} --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}
Choose vocoder type of tts task. Choose vocoder type of tts task.
--voc_config VOC_CONFIG --voc_config VOC_CONFIG
Config of voc. Use deault config when it is None. Config of voc.
--voc_ckpt VOC_CKPT Checkpoint file of voc. --voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize --voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc. spectrogram when training voc.
...@@ -208,9 +206,9 @@ optional arguments: ...@@ -208,9 +206,9 @@ optional arguments:
output dir. output dir.
``` ```
1. `--am` is acoustic model type with the format {model_name}_{dataset} 1. `--am` is acoustic model type with the format {model_name}_{dataset}
2. `--am_config`, `--am_checkpoint`, `--am_stat`, `--phones_dict` `--speaker_dict` are arguments for acoustic model, which correspond to the 5 files in the fastspeech2 pretrained model. 2. `--am_config`, `--am_ckpt`, `--am_stat`, `--phones_dict` `--speaker_dict` are arguments for acoustic model, which correspond to the 5 files in the fastspeech2 pretrained model.
3. `--voc` is vocoder type with the format {model_name}_{dataset} 3. `--voc` is vocoder type with the format {model_name}_{dataset}
4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model. 4. `--voc_config`, `--voc_ckpt`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
5. `--lang` is the model language, which can be `zh` or `en`. 5. `--lang` is the model language, which can be `zh` or `en`.
6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder. 6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder.
7. `--text` is the text file, which contains sentences to synthesize. 7. `--text` is the text file, which contains sentences to synthesize.
......
...@@ -68,7 +68,7 @@ Train a ParallelWaveGAN model. ...@@ -68,7 +68,7 @@ Train a ParallelWaveGAN model.
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--config CONFIG config file to overwrite default config. --config CONFIG ParallelWaveGAN config file.
--train-metadata TRAIN_METADATA --train-metadata TRAIN_METADATA
training data. training data.
--dev-metadata DEV_METADATA --dev-metadata DEV_METADATA
......
...@@ -59,15 +59,13 @@ Here's the complete help message. ...@@ -59,15 +59,13 @@ Here's the complete help message.
```text ```text
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA] usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR] [--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
[--ngpu NGPU] [--batch-size BATCH_SIZE] [--max-iter MAX_ITER] [--ngpu NGPU]
[--run-benchmark RUN_BENCHMARK]
[--profiler_options PROFILER_OPTIONS]
Train a ParallelWaveGAN model. Train a HiFiGAN model.
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--config CONFIG config file to overwrite default config. --config CONFIG HiFiGAN config file.
--train-metadata TRAIN_METADATA --train-metadata TRAIN_METADATA
training data. training data.
--dev-metadata DEV_METADATA --dev-metadata DEV_METADATA
...@@ -75,19 +73,6 @@ optional arguments: ...@@ -75,19 +73,6 @@ optional arguments:
--output-dir OUTPUT_DIR --output-dir OUTPUT_DIR
output dir. output dir.
--ngpu NGPU if ngpu == 0, use cpu. --ngpu NGPU if ngpu == 0, use cpu.
benchmark:
arguments related to benchmark.
--batch-size BATCH_SIZE
batch size.
--max-iter MAX_ITER train max steps.
--run-benchmark RUN_BENCHMARK
runing benchmark or not, if True, use the --batch-size
and --max-iter.
--profiler_options PROFILER_OPTIONS
The option of profiler, which should be in format
"key1=value1;key2=value2;key3=value3".
``` ```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
......
...@@ -103,12 +103,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_p ...@@ -103,12 +103,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_p
``` ```
```text ```text
usage: synthesize.py [-h] usage: synthesize.py [-h]
[--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc}] [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech,tacotron2_aishell3}]
[--am_config AM_CONFIG] [--am_ckpt AM_CKPT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--am_stat AM_STAT] [--phones_dict PHONES_DICT] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT] [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT]
[--voice-cloning VOICE_CLONING] [--voice-cloning VOICE_CLONING]
[--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,wavernn_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,style_melgan_csmsc}]
[--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--ngpu NGPU] [--voc_stat VOC_STAT] [--ngpu NGPU]
[--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR] [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
...@@ -117,11 +117,10 @@ Synthesize with acoustic model & vocoder ...@@ -117,11 +117,10 @@ Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc} --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech,tacotron2_aishell3}
Choose acoustic model type of tts task. Choose acoustic model type of tts task.
--am_config AM_CONFIG --am_config AM_CONFIG
Config of acoustic model. Use deault config when it is Config of acoustic model.
None.
--am_ckpt AM_CKPT Checkpoint file of acoustic model. --am_ckpt AM_CKPT Checkpoint file of acoustic model.
--am_stat AM_STAT mean and standard deviation used to normalize --am_stat AM_STAT mean and standard deviation used to normalize
spectrogram when training acoustic model. spectrogram when training acoustic model.
...@@ -133,10 +132,10 @@ optional arguments: ...@@ -133,10 +132,10 @@ optional arguments:
speaker id map file. speaker id map file.
--voice-cloning VOICE_CLONING --voice-cloning VOICE_CLONING
whether training voice cloning model. whether training voice cloning model.
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc} --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,wavernn_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,style_melgan_csmsc}
Choose vocoder type of tts task. Choose vocoder type of tts task.
--voc_config VOC_CONFIG --voc_config VOC_CONFIG
Config of voc. Use deault config when it is None. Config of voc.
--voc_ckpt VOC_CKPT Checkpoint file of voc. --voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize --voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc. spectrogram when training voc.
...@@ -152,12 +151,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_outp ...@@ -152,12 +151,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_outp
``` ```
```text ```text
usage: synthesize_e2e.py [-h] usage: synthesize_e2e.py [-h]
[--am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc}] [--am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}]
[--am_config AM_CONFIG] [--am_ckpt AM_CKPT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--am_stat AM_STAT] [--phones_dict PHONES_DICT] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--tones_dict TONES_DICT] [--tones_dict TONES_DICT]
[--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID] [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
[--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc}] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}]
[--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--lang LANG] [--voc_stat VOC_STAT] [--lang LANG]
[--inference_dir INFERENCE_DIR] [--ngpu NGPU] [--inference_dir INFERENCE_DIR] [--ngpu NGPU]
...@@ -167,11 +166,10 @@ Synthesize with acoustic model & vocoder ...@@ -167,11 +166,10 @@ Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc} --am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}
Choose acoustic model type of tts task. Choose acoustic model type of tts task.
--am_config AM_CONFIG --am_config AM_CONFIG
Config of acoustic model. Use deault config when it is Config of acoustic model.
None.
--am_ckpt AM_CKPT Checkpoint file of acoustic model. --am_ckpt AM_CKPT Checkpoint file of acoustic model.
--am_stat AM_STAT mean and standard deviation used to normalize --am_stat AM_STAT mean and standard deviation used to normalize
spectrogram when training acoustic model. spectrogram when training acoustic model.
...@@ -182,10 +180,10 @@ optional arguments: ...@@ -182,10 +180,10 @@ optional arguments:
--speaker_dict SPEAKER_DICT --speaker_dict SPEAKER_DICT
speaker id map file. speaker id map file.
--spk_id SPK_ID spk id for multi speaker acoustic model --spk_id SPK_ID spk id for multi speaker acoustic model
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc} --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}
Choose vocoder type of tts task. Choose vocoder type of tts task.
--voc_config VOC_CONFIG --voc_config VOC_CONFIG
Config of voc. Use deault config when it is None. Config of voc.
--voc_ckpt VOC_CKPT Checkpoint file of voc. --voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize --voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc. spectrogram when training voc.
...@@ -198,9 +196,9 @@ optional arguments: ...@@ -198,9 +196,9 @@ optional arguments:
output dir. output dir.
``` ```
1. `--am` is acoustic model type with the format {model_name}_{dataset} 1. `--am` is acoustic model type with the format {model_name}_{dataset}
2. `--am_config`, `--am_checkpoint`, `--am_stat` and `--phones_dict` are arguments for acoustic model, which correspond to the 4 files in the Tacotron2 pretrained model. 2. `--am_config`, `--am_ckpt`, `--am_stat` and `--phones_dict` are arguments for acoustic model, which correspond to the 4 files in the Tacotron2 pretrained model.
3. `--voc` is vocoder type with the format {model_name}_{dataset} 3. `--voc` is vocoder type with the format {model_name}_{dataset}
4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model. 4. `--voc_config`, `--voc_ckpt`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
5. `--lang` is the model language, which can be `zh` or `en`. 5. `--lang` is the model language, which can be `zh` or `en`.
6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder. 6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder.
7. `--text` is the text file, which contains sentences to synthesize. 7. `--text` is the text file, which contains sentences to synthesize.
......
...@@ -109,12 +109,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_p ...@@ -109,12 +109,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_p
``` ```
```text ```text
usage: synthesize.py [-h] usage: synthesize.py [-h]
[--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}] [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech,tacotron2_aishell3}]
[--am_config AM_CONFIG] [--am_ckpt AM_CKPT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--am_stat AM_STAT] [--phones_dict PHONES_DICT] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT] [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT]
[--voice-cloning VOICE_CLONING] [--voice-cloning VOICE_CLONING]
[--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,wavernn_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,style_melgan_csmsc}]
[--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--ngpu NGPU] [--voc_stat VOC_STAT] [--ngpu NGPU]
[--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR] [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
...@@ -123,11 +123,10 @@ Synthesize with acoustic model & vocoder ...@@ -123,11 +123,10 @@ Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk} --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech,tacotron2_aishell3}
Choose acoustic model type of tts task. Choose acoustic model type of tts task.
--am_config AM_CONFIG --am_config AM_CONFIG
Config of acoustic model. Use deault config when it is Config of acoustic model.
None.
--am_ckpt AM_CKPT Checkpoint file of acoustic model. --am_ckpt AM_CKPT Checkpoint file of acoustic model.
--am_stat AM_STAT mean and standard deviation used to normalize --am_stat AM_STAT mean and standard deviation used to normalize
spectrogram when training acoustic model. spectrogram when training acoustic model.
...@@ -139,10 +138,10 @@ optional arguments: ...@@ -139,10 +138,10 @@ optional arguments:
speaker id map file. speaker id map file.
--voice-cloning VOICE_CLONING --voice-cloning VOICE_CLONING
whether training voice cloning model. whether training voice cloning model.
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc} --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,wavernn_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,style_melgan_csmsc}
Choose vocoder type of tts task. Choose vocoder type of tts task.
--voc_config VOC_CONFIG --voc_config VOC_CONFIG
Config of voc. Use deault config when it is None. Config of voc.
--voc_ckpt VOC_CKPT Checkpoint file of voc. --voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize --voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc. spectrogram when training voc.
...@@ -158,12 +157,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_outp ...@@ -158,12 +157,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_outp
``` ```
```text ```text
usage: synthesize_e2e.py [-h] usage: synthesize_e2e.py [-h]
[--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}] [--am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}]
[--am_config AM_CONFIG] [--am_ckpt AM_CKPT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--am_stat AM_STAT] [--phones_dict PHONES_DICT] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--tones_dict TONES_DICT] [--tones_dict TONES_DICT]
[--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID] [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
[--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}]
[--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--lang LANG] [--voc_stat VOC_STAT] [--lang LANG]
[--inference_dir INFERENCE_DIR] [--ngpu NGPU] [--inference_dir INFERENCE_DIR] [--ngpu NGPU]
...@@ -173,11 +172,10 @@ Synthesize with acoustic model & vocoder ...@@ -173,11 +172,10 @@ Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk} --am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}
Choose acoustic model type of tts task. Choose acoustic model type of tts task.
--am_config AM_CONFIG --am_config AM_CONFIG
Config of acoustic model. Use deault config when it is Config of acoustic model.
None.
--am_ckpt AM_CKPT Checkpoint file of acoustic model. --am_ckpt AM_CKPT Checkpoint file of acoustic model.
--am_stat AM_STAT mean and standard deviation used to normalize --am_stat AM_STAT mean and standard deviation used to normalize
spectrogram when training acoustic model. spectrogram when training acoustic model.
...@@ -188,10 +186,10 @@ optional arguments: ...@@ -188,10 +186,10 @@ optional arguments:
--speaker_dict SPEAKER_DICT --speaker_dict SPEAKER_DICT
speaker id map file. speaker id map file.
--spk_id SPK_ID spk id for multi speaker acoustic model --spk_id SPK_ID spk id for multi speaker acoustic model
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc} --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}
Choose vocoder type of tts task. Choose vocoder type of tts task.
--voc_config VOC_CONFIG --voc_config VOC_CONFIG
Config of voc. Use deault config when it is None. Config of voc.
--voc_ckpt VOC_CKPT Checkpoint file of voc. --voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize --voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc. spectrogram when training voc.
...@@ -204,9 +202,9 @@ optional arguments: ...@@ -204,9 +202,9 @@ optional arguments:
output dir. output dir.
``` ```
1. `--am` is acoustic model type with the format {model_name}_{dataset} 1. `--am` is acoustic model type with the format {model_name}_{dataset}
2. `--am_config`, `--am_checkpoint`, `--am_stat`, `--phones_dict` and `--tones_dict` are arguments for acoustic model, which correspond to the 5 files in the speedyspeech pretrained model. 2. `--am_config`, `--am_ckpt`, `--am_stat`, `--phones_dict` and `--tones_dict` are arguments for acoustic model, which correspond to the 5 files in the speedyspeech pretrained model.
3. `--voc` is vocoder type with the format {model_name}_{dataset} 3. `--voc` is vocoder type with the format {model_name}_{dataset}
4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model. 4. `--voc_config`, `--voc_ckpt`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
5. `--lang` is the model language, which can be `zh` or `en`. 5. `--lang` is the model language, which can be `zh` or `en`.
6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder. 6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder.
7. `--text` is the text file, which contains sentences to synthesize. 7. `--text` is the text file, which contains sentences to synthesize.
......
...@@ -111,12 +111,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_p ...@@ -111,12 +111,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_p
``` ```
```text ```text
usage: synthesize.py [-h] usage: synthesize.py [-h]
[--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}] [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech,tacotron2_aishell3}]
[--am_config AM_CONFIG] [--am_ckpt AM_CKPT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--am_stat AM_STAT] [--phones_dict PHONES_DICT] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT] [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT]
[--voice-cloning VOICE_CLONING] [--voice-cloning VOICE_CLONING]
[--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,wavernn_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,style_melgan_csmsc}]
[--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--ngpu NGPU] [--voc_stat VOC_STAT] [--ngpu NGPU]
[--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR] [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
...@@ -125,11 +125,10 @@ Synthesize with acoustic model & vocoder ...@@ -125,11 +125,10 @@ Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk} --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech,tacotron2_aishell3}
Choose acoustic model type of tts task. Choose acoustic model type of tts task.
--am_config AM_CONFIG --am_config AM_CONFIG
Config of acoustic model. Use deault config when it is Config of acoustic model.
None.
--am_ckpt AM_CKPT Checkpoint file of acoustic model. --am_ckpt AM_CKPT Checkpoint file of acoustic model.
--am_stat AM_STAT mean and standard deviation used to normalize --am_stat AM_STAT mean and standard deviation used to normalize
spectrogram when training acoustic model. spectrogram when training acoustic model.
...@@ -141,10 +140,10 @@ optional arguments: ...@@ -141,10 +140,10 @@ optional arguments:
speaker id map file. speaker id map file.
--voice-cloning VOICE_CLONING --voice-cloning VOICE_CLONING
whether training voice cloning model. whether training voice cloning model.
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc} --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,wavernn_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,style_melgan_csmsc}
Choose vocoder type of tts task. Choose vocoder type of tts task.
--voc_config VOC_CONFIG --voc_config VOC_CONFIG
Config of voc. Use deault config when it is None. Config of voc.
--voc_ckpt VOC_CKPT Checkpoint file of voc. --voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize --voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc. spectrogram when training voc.
...@@ -160,12 +159,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_outp ...@@ -160,12 +159,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_outp
``` ```
```text ```text
usage: synthesize_e2e.py [-h] usage: synthesize_e2e.py [-h]
[--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}] [--am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}]
[--am_config AM_CONFIG] [--am_ckpt AM_CKPT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--am_stat AM_STAT] [--phones_dict PHONES_DICT] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--tones_dict TONES_DICT] [--tones_dict TONES_DICT]
[--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID] [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
[--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}]
[--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--lang LANG] [--voc_stat VOC_STAT] [--lang LANG]
[--inference_dir INFERENCE_DIR] [--ngpu NGPU] [--inference_dir INFERENCE_DIR] [--ngpu NGPU]
...@@ -175,11 +174,10 @@ Synthesize with acoustic model & vocoder ...@@ -175,11 +174,10 @@ Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk} --am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}
Choose acoustic model type of tts task. Choose acoustic model type of tts task.
--am_config AM_CONFIG --am_config AM_CONFIG
Config of acoustic model. Use deault config when it is Config of acoustic model.
None.
--am_ckpt AM_CKPT Checkpoint file of acoustic model. --am_ckpt AM_CKPT Checkpoint file of acoustic model.
--am_stat AM_STAT mean and standard deviation used to normalize --am_stat AM_STAT mean and standard deviation used to normalize
spectrogram when training acoustic model. spectrogram when training acoustic model.
...@@ -190,10 +188,10 @@ optional arguments: ...@@ -190,10 +188,10 @@ optional arguments:
--speaker_dict SPEAKER_DICT --speaker_dict SPEAKER_DICT
speaker id map file. speaker id map file.
--spk_id SPK_ID spk id for multi speaker acoustic model --spk_id SPK_ID spk id for multi speaker acoustic model
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc} --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}
Choose vocoder type of tts task. Choose vocoder type of tts task.
--voc_config VOC_CONFIG --voc_config VOC_CONFIG
Config of voc. Use deault config when it is None. Config of voc.
--voc_ckpt VOC_CKPT Checkpoint file of voc. --voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize --voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc. spectrogram when training voc.
...@@ -204,11 +202,12 @@ optional arguments: ...@@ -204,11 +202,12 @@ optional arguments:
--text TEXT text to synthesize, a 'utt_id sentence' pair per line. --text TEXT text to synthesize, a 'utt_id sentence' pair per line.
--output_dir OUTPUT_DIR --output_dir OUTPUT_DIR
output dir. output dir.
``` ```
1. `--am` is acoustic model type with the format {model_name}_{dataset} 1. `--am` is acoustic model type with the format {model_name}_{dataset}
2. `--am_config`, `--am_checkpoint`, `--am_stat` and `--phones_dict` are arguments for acoustic model, which correspond to the 4 files in the fastspeech2 pretrained model. 2. `--am_config`, `--am_ckpt`, `--am_stat` and `--phones_dict` are arguments for acoustic model, which correspond to the 4 files in the fastspeech2 pretrained model.
3. `--voc` is vocoder type with the format {model_name}_{dataset} 3. `--voc` is vocoder type with the format {model_name}_{dataset}
4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model. 4. `--voc_config`, `--voc_ckpt`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
5. `--lang` is the model language, which can be `zh` or `en`. 5. `--lang` is the model language, which can be `zh` or `en`.
6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder. 6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder.
7. `--text` is the text file, which contains sentences to synthesize. 7. `--text` is the text file, which contains sentences to synthesize.
......
...@@ -117,12 +117,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_p ...@@ -117,12 +117,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_p
``` ```
```text ```text
usage: synthesize.py [-h] usage: synthesize.py [-h]
[--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}] [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech,tacotron2_aishell3}]
[--am_config AM_CONFIG] [--am_ckpt AM_CKPT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--am_stat AM_STAT] [--phones_dict PHONES_DICT] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT] [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT]
[--voice-cloning VOICE_CLONING] [--voice-cloning VOICE_CLONING]
[--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,wavernn_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,style_melgan_csmsc}]
[--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--ngpu NGPU] [--voc_stat VOC_STAT] [--ngpu NGPU]
[--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR] [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
...@@ -131,11 +131,10 @@ Synthesize with acoustic model & vocoder ...@@ -131,11 +131,10 @@ Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk} --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech,tacotron2_aishell3}
Choose acoustic model type of tts task. Choose acoustic model type of tts task.
--am_config AM_CONFIG --am_config AM_CONFIG
Config of acoustic model. Use deault config when it is Config of acoustic model.
None.
--am_ckpt AM_CKPT Checkpoint file of acoustic model. --am_ckpt AM_CKPT Checkpoint file of acoustic model.
--am_stat AM_STAT mean and standard deviation used to normalize --am_stat AM_STAT mean and standard deviation used to normalize
spectrogram when training acoustic model. spectrogram when training acoustic model.
...@@ -147,10 +146,10 @@ optional arguments: ...@@ -147,10 +146,10 @@ optional arguments:
speaker id map file. speaker id map file.
--voice-cloning VOICE_CLONING --voice-cloning VOICE_CLONING
whether training voice cloning model. whether training voice cloning model.
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc} --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,wavernn_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,style_melgan_csmsc}
Choose vocoder type of tts task. Choose vocoder type of tts task.
--voc_config VOC_CONFIG --voc_config VOC_CONFIG
Config of voc. Use deault config when it is None. Config of voc.
--voc_ckpt VOC_CKPT Checkpoint file of voc. --voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize --voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc. spectrogram when training voc.
...@@ -167,12 +166,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_outp ...@@ -167,12 +166,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_outp
``` ```
```text ```text
usage: synthesize_e2e.py [-h] usage: synthesize_e2e.py [-h]
[--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}] [--am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}]
[--am_config AM_CONFIG] [--am_ckpt AM_CKPT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--am_stat AM_STAT] [--phones_dict PHONES_DICT] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--tones_dict TONES_DICT] [--tones_dict TONES_DICT]
[--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID] [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
[--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}]
[--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--lang LANG] [--voc_stat VOC_STAT] [--lang LANG]
[--inference_dir INFERENCE_DIR] [--ngpu NGPU] [--inference_dir INFERENCE_DIR] [--ngpu NGPU]
...@@ -182,11 +181,10 @@ Synthesize with acoustic model & vocoder ...@@ -182,11 +181,10 @@ Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk} --am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}
Choose acoustic model type of tts task. Choose acoustic model type of tts task.
--am_config AM_CONFIG --am_config AM_CONFIG
Config of acoustic model. Use deault config when it is Config of acoustic model.
None.
--am_ckpt AM_CKPT Checkpoint file of acoustic model. --am_ckpt AM_CKPT Checkpoint file of acoustic model.
--am_stat AM_STAT mean and standard deviation used to normalize --am_stat AM_STAT mean and standard deviation used to normalize
spectrogram when training acoustic model. spectrogram when training acoustic model.
...@@ -197,10 +195,10 @@ optional arguments: ...@@ -197,10 +195,10 @@ optional arguments:
--speaker_dict SPEAKER_DICT --speaker_dict SPEAKER_DICT
speaker id map file. speaker id map file.
--spk_id SPK_ID spk id for multi speaker acoustic model --spk_id SPK_ID spk id for multi speaker acoustic model
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc} --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}
Choose vocoder type of tts task. Choose vocoder type of tts task.
--voc_config VOC_CONFIG --voc_config VOC_CONFIG
Config of voc. Use deault config when it is None. Config of voc.
--voc_ckpt VOC_CKPT Checkpoint file of voc. --voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize --voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc. spectrogram when training voc.
...@@ -213,9 +211,9 @@ optional arguments: ...@@ -213,9 +211,9 @@ optional arguments:
output dir. output dir.
``` ```
1. `--am` 声学模型格式是否符合 {model_name}_{dataset} 1. `--am` 声学模型格式是否符合 {model_name}_{dataset}
2. `--am_config`, `--am_checkpoint`, `--am_stat``--phones_dict` 是声学模型的参数,对应于 fastspeech2 预训练模型中的 4 个文件。 2. `--am_config`, `--am_ckpt`, `--am_stat``--phones_dict` 是声学模型的参数,对应于 fastspeech2 预训练模型中的 4 个文件。
3. `--voc` 声码器(vocoder)格式是否符合 {model_name}_{dataset} 3. `--voc` 声码器(vocoder)格式是否符合 {model_name}_{dataset}
4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` 是声码器的参数,对应于 parallel wavegan 预训练模型中的 3 个文件。 4. `--voc_config`, `--voc_ckpt`, `--voc_stat` 是声码器的参数,对应于 parallel wavegan 预训练模型中的 3 个文件。
5. `--lang` 对应模型的语言可以是 `zh``en` 5. `--lang` 对应模型的语言可以是 `zh``en`
6. `--test_metadata` 应为 `dump` 文件夹中 `test` 下的规范化元数据文件、 6. `--test_metadata` 应为 `dump` 文件夹中 `test` 下的规范化元数据文件、
7. `--text` 是文本文件,其中包含要合成的句子。 7. `--text` 是文本文件,其中包含要合成的句子。
......
# VITS with CSMSC
This example contains code used to train a [VITS](https://arxiv.org/abs/2106.06103) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html).
## Dataset
### Download and Extract
Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/source).
### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get phonemes for VITS, the durations of MFA are not needed here.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
## Get Started
Assume the path to the dataset is `~/datasets/BZNSYP`.
Assume the path to the MFA result of CSMSC is `./baker_alignment_tone`.
Run the command below to
1. **source path**.
2. preprocess the dataset.
3. train the model.
4. synthesize wavs.
- synthesize waveform from `metadata.jsonl`.
- synthesize waveform from a text file.
```bash
./run.sh
```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash
./run.sh --stage 0 --stop-stage 0
```
### Data Preprocessing
```bash
./local/preprocess.sh ${conf_path}
```
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
```text
dump
├── dev
│   ├── norm
│   └── raw
├── phone_id_map.txt
├── speaker_id_map.txt
├── test
│   ├── norm
│   └── raw
└── train
├── feats_stats.npy
├── norm
└── raw
```
The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains wave and linear spectrogram of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/feats_stats.npy`.
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, feats, feats_lengths, the path of linear spectrogram features, the path of raw waves, speaker, and the id of each utterance.
### Model Training
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
```
`./local/train.sh` calls `${BIN_DIR}/train.py`.
Here's the complete help message.
```text
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
[--ngpu NGPU] [--phones-dict PHONES_DICT]
Train a VITS model.
optional arguments:
-h, --help show this help message and exit
--config CONFIG config file to overwrite default config.
--train-metadata TRAIN_METADATA
training data.
--dev-metadata DEV_METADATA
dev data.
--output-dir OUTPUT_DIR
output dir.
--ngpu NGPU if ngpu == 0, use cpu.
--phones-dict PHONES_DICT
phone vocabulary file.
```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
5. `--phones-dict` is the path of the phone vocabulary file.
### Synthesizing
`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
```
```text
usage: synthesize.py [-h] [--config CONFIG] [--ckpt CKPT]
[--phones_dict PHONES_DICT] [--ngpu NGPU]
[--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
Synthesize with VITS
optional arguments:
-h, --help show this help message and exit
--config CONFIG Config of VITS.
--ckpt CKPT Checkpoint file of VITS.
--phones_dict PHONES_DICT
phone vocabulary file.
--ngpu NGPU if ngpu == 0, use cpu.
--test_metadata TEST_METADATA
test metadata.
--output_dir OUTPUT_DIR
output dir.
```
`./local/synthesize_e2e.sh` calls `${BIN_DIR}/synthesize_e2e.py`, which can synthesize waveform from text file.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name}
```
```text
usage: synthesize_e2e.py [-h] [--config CONFIG] [--ckpt CKPT]
[--phones_dict PHONES_DICT] [--lang LANG]
[--inference_dir INFERENCE_DIR] [--ngpu NGPU]
[--text TEXT] [--output_dir OUTPUT_DIR]
Synthesize with VITS
optional arguments:
-h, --help show this help message and exit
--config CONFIG Config of VITS.
--ckpt CKPT Checkpoint file of VITS.
--phones_dict PHONES_DICT
phone vocabulary file.
--lang LANG Choose model language. zh or en
--inference_dir INFERENCE_DIR
dir to save inference models
--ngpu NGPU if ngpu == 0, use cpu.
--text TEXT text to synthesize, a 'utt_id sentence' pair per line.
--output_dir OUTPUT_DIR
output dir.
```
1. `--config`, `--ckpt`, and `--phones_dict` are arguments for acoustic model, which correspond to the 3 files in the VITS pretrained model.
2. `--lang` is the model language, which can be `zh` or `en`.
3. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder.
4. `--text` is the text file, which contains sentences to synthesize.
5. `--output_dir` is the directory to save synthesized audio files.
6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model
...@@ -65,7 +65,7 @@ Train a ParallelWaveGAN model. ...@@ -65,7 +65,7 @@ Train a ParallelWaveGAN model.
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--config CONFIG config file to overwrite default config. --config CONFIG ParallelWaveGAN config file.
--train-metadata TRAIN_METADATA --train-metadata TRAIN_METADATA
training data. training data.
--dev-metadata DEV_METADATA --dev-metadata DEV_METADATA
......
...@@ -63,7 +63,7 @@ Train a Multi-Band MelGAN model. ...@@ -63,7 +63,7 @@ Train a Multi-Band MelGAN model.
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--config CONFIG config file to overwrite default config. --config CONFIG Multi-Band MelGAN config file.
--train-metadata TRAIN_METADATA --train-metadata TRAIN_METADATA
training data. training data.
--dev-metadata DEV_METADATA --dev-metadata DEV_METADATA
......
...@@ -63,7 +63,7 @@ Train a Style MelGAN model. ...@@ -63,7 +63,7 @@ Train a Style MelGAN model.
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--config CONFIG config file to overwrite default config. --config CONFIG Style MelGAN config file.
--train-metadata TRAIN_METADATA --train-metadata TRAIN_METADATA
training data. training data.
--dev-metadata DEV_METADATA --dev-metadata DEV_METADATA
......
...@@ -63,7 +63,7 @@ Train a HiFiGAN model. ...@@ -63,7 +63,7 @@ Train a HiFiGAN model.
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--config CONFIG config file to overwrite default config. --config CONFIG HiFiGAN config file.
--train-metadata TRAIN_METADATA --train-metadata TRAIN_METADATA
training data. training data.
--dev-metadata DEV_METADATA --dev-metadata DEV_METADATA
......
...@@ -63,7 +63,7 @@ Train a WaveRNN model. ...@@ -63,7 +63,7 @@ Train a WaveRNN model.
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--config CONFIG config file to overwrite default config. --config CONFIG WaveRNN config file.
--train-metadata TRAIN_METADATA --train-metadata TRAIN_METADATA
training data. training data.
--dev-metadata DEV_METADATA --dev-metadata DEV_METADATA
......
...@@ -103,12 +103,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_p ...@@ -103,12 +103,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_p
``` ```
```text ```text
usage: synthesize.py [-h] usage: synthesize.py [-h]
[--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc}] [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech,tacotron2_aishell3}]
[--am_config AM_CONFIG] [--am_ckpt AM_CKPT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--am_stat AM_STAT] [--phones_dict PHONES_DICT] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT] [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT]
[--voice-cloning VOICE_CLONING] [--voice-cloning VOICE_CLONING]
[--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,wavernn_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,style_melgan_csmsc}]
[--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--ngpu NGPU] [--voc_stat VOC_STAT] [--ngpu NGPU]
[--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR] [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
...@@ -117,11 +117,10 @@ Synthesize with acoustic model & vocoder ...@@ -117,11 +117,10 @@ Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc} --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech,tacotron2_aishell3}
Choose acoustic model type of tts task. Choose acoustic model type of tts task.
--am_config AM_CONFIG --am_config AM_CONFIG
Config of acoustic model. Use deault config when it is Config of acoustic model.
None.
--am_ckpt AM_CKPT Checkpoint file of acoustic model. --am_ckpt AM_CKPT Checkpoint file of acoustic model.
--am_stat AM_STAT mean and standard deviation used to normalize --am_stat AM_STAT mean and standard deviation used to normalize
spectrogram when training acoustic model. spectrogram when training acoustic model.
...@@ -133,10 +132,10 @@ optional arguments: ...@@ -133,10 +132,10 @@ optional arguments:
speaker id map file. speaker id map file.
--voice-cloning VOICE_CLONING --voice-cloning VOICE_CLONING
whether training voice cloning model. whether training voice cloning model.
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc} --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,wavernn_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,style_melgan_csmsc}
Choose vocoder type of tts task. Choose vocoder type of tts task.
--voc_config VOC_CONFIG --voc_config VOC_CONFIG
Config of voc. Use deault config when it is None. Config of voc.
--voc_ckpt VOC_CKPT Checkpoint file of voc. --voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize --voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc. spectrogram when training voc.
...@@ -152,12 +151,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_outp ...@@ -152,12 +151,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_outp
``` ```
```text ```text
usage: synthesize_e2e.py [-h] usage: synthesize_e2e.py [-h]
[--am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc}] [--am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}]
[--am_config AM_CONFIG] [--am_ckpt AM_CKPT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--am_stat AM_STAT] [--phones_dict PHONES_DICT] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--tones_dict TONES_DICT] [--tones_dict TONES_DICT]
[--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID] [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
[--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc}] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}]
[--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--lang LANG] [--voc_stat VOC_STAT] [--lang LANG]
[--inference_dir INFERENCE_DIR] [--ngpu NGPU] [--inference_dir INFERENCE_DIR] [--ngpu NGPU]
...@@ -167,11 +166,10 @@ Synthesize with acoustic model & vocoder ...@@ -167,11 +166,10 @@ Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc} --am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}
Choose acoustic model type of tts task. Choose acoustic model type of tts task.
--am_config AM_CONFIG --am_config AM_CONFIG
Config of acoustic model. Use deault config when it is Config of acoustic model.
None.
--am_ckpt AM_CKPT Checkpoint file of acoustic model. --am_ckpt AM_CKPT Checkpoint file of acoustic model.
--am_stat AM_STAT mean and standard deviation used to normalize --am_stat AM_STAT mean and standard deviation used to normalize
spectrogram when training acoustic model. spectrogram when training acoustic model.
...@@ -182,10 +180,10 @@ optional arguments: ...@@ -182,10 +180,10 @@ optional arguments:
--speaker_dict SPEAKER_DICT --speaker_dict SPEAKER_DICT
speaker id map file. speaker id map file.
--spk_id SPK_ID spk id for multi speaker acoustic model --spk_id SPK_ID spk id for multi speaker acoustic model
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc} --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}
Choose vocoder type of tts task. Choose vocoder type of tts task.
--voc_config VOC_CONFIG --voc_config VOC_CONFIG
Config of voc. Use deault config when it is None. Config of voc.
--voc_ckpt VOC_CKPT Checkpoint file of voc. --voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize --voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc. spectrogram when training voc.
...@@ -198,9 +196,9 @@ optional arguments: ...@@ -198,9 +196,9 @@ optional arguments:
output dir. output dir.
``` ```
1. `--am` is acoustic model type with the format {model_name}_{dataset} 1. `--am` is acoustic model type with the format {model_name}_{dataset}
2. `--am_config`, `--am_checkpoint`, `--am_stat` and `--phones_dict` are arguments for acoustic model, which correspond to the 4 files in the Tacotron2 pretrained model. 2. `--am_config`, `--am_ckpt`, `--am_stat` and `--phones_dict` are arguments for acoustic model, which correspond to the 4 files in the Tacotron2 pretrained model.
3. `--voc` is vocoder type with the format {model_name}_{dataset} 3. `--voc` is vocoder type with the format {model_name}_{dataset}
4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model. 4. `--voc_config`, `--voc_ckpt`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
5. `--lang` is the model language, which can be `zh` or `en`. 5. `--lang` is the model language, which can be `zh` or `en`.
6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder. 6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder.
7. `--text` is the text file, which contains sentences to synthesize. 7. `--text` is the text file, which contains sentences to synthesize.
......
...@@ -58,7 +58,7 @@ Train a TransformerTTS model with LJSpeech TTS dataset. ...@@ -58,7 +58,7 @@ Train a TransformerTTS model with LJSpeech TTS dataset.
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--config CONFIG config file to overwrite default config. --config CONFIG TransformerTTS config file.
--train-metadata TRAIN_METADATA --train-metadata TRAIN_METADATA
training data. training data.
--dev-metadata DEV_METADATA --dev-metadata DEV_METADATA
......
...@@ -109,12 +109,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_p ...@@ -109,12 +109,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_p
``` ```
``text ``text
usage: synthesize.py [-h] usage: synthesize.py [-h]
[--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}] [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech,tacotron2_aishell3}]
[--am_config AM_CONFIG] [--am_ckpt AM_CKPT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--am_stat AM_STAT] [--phones_dict PHONES_DICT] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT] [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT]
[--voice-cloning VOICE_CLONING] [--voice-cloning VOICE_CLONING]
[--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,wavernn_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,style_melgan_csmsc}]
[--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--ngpu NGPU] [--voc_stat VOC_STAT] [--ngpu NGPU]
[--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR] [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
...@@ -123,11 +123,10 @@ Synthesize with acoustic model & vocoder ...@@ -123,11 +123,10 @@ Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk} --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech,tacotron2_aishell3}
Choose acoustic model type of tts task. Choose acoustic model type of tts task.
--am_config AM_CONFIG --am_config AM_CONFIG
Config of acoustic model. Use deault config when it is Config of acoustic model.
None.
--am_ckpt AM_CKPT Checkpoint file of acoustic model. --am_ckpt AM_CKPT Checkpoint file of acoustic model.
--am_stat AM_STAT mean and standard deviation used to normalize --am_stat AM_STAT mean and standard deviation used to normalize
spectrogram when training acoustic model. spectrogram when training acoustic model.
...@@ -139,10 +138,10 @@ optional arguments: ...@@ -139,10 +138,10 @@ optional arguments:
speaker id map file. speaker id map file.
--voice-cloning VOICE_CLONING --voice-cloning VOICE_CLONING
whether training voice cloning model. whether training voice cloning model.
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc} --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,wavernn_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,style_melgan_csmsc}
Choose vocoder type of tts task. Choose vocoder type of tts task.
--voc_config VOC_CONFIG --voc_config VOC_CONFIG
Config of voc. Use deault config when it is None. Config of voc.
--voc_ckpt VOC_CKPT Checkpoint file of voc. --voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize --voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc. spectrogram when training voc.
...@@ -158,12 +157,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_outp ...@@ -158,12 +157,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_outp
``` ```
```text ```text
usage: synthesize_e2e.py [-h] usage: synthesize_e2e.py [-h]
[--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}] [--am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}]
[--am_config AM_CONFIG] [--am_ckpt AM_CKPT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--am_stat AM_STAT] [--phones_dict PHONES_DICT] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--tones_dict TONES_DICT] [--tones_dict TONES_DICT]
[--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID] [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
[--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}]
[--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--lang LANG] [--voc_stat VOC_STAT] [--lang LANG]
[--inference_dir INFERENCE_DIR] [--ngpu NGPU] [--inference_dir INFERENCE_DIR] [--ngpu NGPU]
...@@ -173,11 +172,10 @@ Synthesize with acoustic model & vocoder ...@@ -173,11 +172,10 @@ Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk} --am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}
Choose acoustic model type of tts task. Choose acoustic model type of tts task.
--am_config AM_CONFIG --am_config AM_CONFIG
Config of acoustic model. Use deault config when it is Config of acoustic model.
None.
--am_ckpt AM_CKPT Checkpoint file of acoustic model. --am_ckpt AM_CKPT Checkpoint file of acoustic model.
--am_stat AM_STAT mean and standard deviation used to normalize --am_stat AM_STAT mean and standard deviation used to normalize
spectrogram when training acoustic model. spectrogram when training acoustic model.
...@@ -188,10 +186,10 @@ optional arguments: ...@@ -188,10 +186,10 @@ optional arguments:
--speaker_dict SPEAKER_DICT --speaker_dict SPEAKER_DICT
speaker id map file. speaker id map file.
--spk_id SPK_ID spk id for multi speaker acoustic model --spk_id SPK_ID spk id for multi speaker acoustic model
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc} --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}
Choose vocoder type of tts task. Choose vocoder type of tts task.
--voc_config VOC_CONFIG --voc_config VOC_CONFIG
Config of voc. Use deault config when it is None. Config of voc.
--voc_ckpt VOC_CKPT Checkpoint file of voc. --voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize --voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc. spectrogram when training voc.
...@@ -204,9 +202,9 @@ optional arguments: ...@@ -204,9 +202,9 @@ optional arguments:
output dir. output dir.
``` ```
1. `--am` is acoustic model type with the format {model_name}_{dataset} 1. `--am` is acoustic model type with the format {model_name}_{dataset}
2. `--am_config`, `--am_checkpoint`, `--am_stat` and `--phones_dict` are arguments for acoustic model, which correspond to the 4 files in the fastspeech2 pretrained model. 2. `--am_config`, `--am_ckpt`, `--am_stat` and `--phones_dict` are arguments for acoustic model, which correspond to the 4 files in the fastspeech2 pretrained model.
3. `--voc` is vocoder type with the format {model_name}_{dataset} 3. `--voc` is vocoder type with the format {model_name}_{dataset}
4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model. 4. `--voc_config`, `--voc_ckpt`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
5. `--lang` is the model language, which can be `zh` or `en`. 5. `--lang` is the model language, which can be `zh` or `en`.
6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder. 6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder.
7. `--text` is the text file, which contains sentences to synthesize. 7. `--text` is the text file, which contains sentences to synthesize.
......
...@@ -65,7 +65,7 @@ Train a ParallelWaveGAN model. ...@@ -65,7 +65,7 @@ Train a ParallelWaveGAN model.
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--config CONFIG config file to overwrite default config. --config CONFIG ParallelWaveGAN config file.
--train-metadata TRAIN_METADATA --train-metadata TRAIN_METADATA
training data. training data.
--dev-metadata DEV_METADATA --dev-metadata DEV_METADATA
......
...@@ -57,15 +57,13 @@ Here's the complete help message. ...@@ -57,15 +57,13 @@ Here's the complete help message.
```text ```text
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA] usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR] [--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
[--ngpu NGPU] [--batch-size BATCH_SIZE] [--max-iter MAX_ITER] [--ngpu NGPU]
[--run-benchmark RUN_BENCHMARK]
[--profiler_options PROFILER_OPTIONS]
Train a ParallelWaveGAN model. Train a HiFiGAN model.
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--config CONFIG config file to overwrite default config. --config CONFIG HiFiGAN config file.
--train-metadata TRAIN_METADATA --train-metadata TRAIN_METADATA
training data. training data.
--dev-metadata DEV_METADATA --dev-metadata DEV_METADATA
...@@ -73,19 +71,6 @@ optional arguments: ...@@ -73,19 +71,6 @@ optional arguments:
--output-dir OUTPUT_DIR --output-dir OUTPUT_DIR
output dir. output dir.
--ngpu NGPU if ngpu == 0, use cpu. --ngpu NGPU if ngpu == 0, use cpu.
benchmark:
arguments related to benchmark.
--batch-size BATCH_SIZE
batch size.
--max-iter MAX_ITER train max steps.
--run-benchmark RUN_BENCHMARK
runing benchmark or not, if True, use the --batch-size
and --max-iter.
--profiler_options PROFILER_OPTIONS
The option of profiler, which should be in format
"key1=value1;key2=value2;key3=value3".
``` ```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
......
...@@ -112,12 +112,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_p ...@@ -112,12 +112,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_p
``` ```
```text ```text
usage: synthesize.py [-h] usage: synthesize.py [-h]
[--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}] [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech,tacotron2_aishell3}]
[--am_config AM_CONFIG] [--am_ckpt AM_CKPT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--am_stat AM_STAT] [--phones_dict PHONES_DICT] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT] [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT]
[--voice-cloning VOICE_CLONING] [--voice-cloning VOICE_CLONING]
[--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,wavernn_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,style_melgan_csmsc}]
[--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--ngpu NGPU] [--voc_stat VOC_STAT] [--ngpu NGPU]
[--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR] [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
...@@ -126,11 +126,10 @@ Synthesize with acoustic model & vocoder ...@@ -126,11 +126,10 @@ Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk} --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech,tacotron2_aishell3}
Choose acoustic model type of tts task. Choose acoustic model type of tts task.
--am_config AM_CONFIG --am_config AM_CONFIG
Config of acoustic model. Use deault config when it is Config of acoustic model.
None.
--am_ckpt AM_CKPT Checkpoint file of acoustic model. --am_ckpt AM_CKPT Checkpoint file of acoustic model.
--am_stat AM_STAT mean and standard deviation used to normalize --am_stat AM_STAT mean and standard deviation used to normalize
spectrogram when training acoustic model. spectrogram when training acoustic model.
...@@ -142,10 +141,10 @@ optional arguments: ...@@ -142,10 +141,10 @@ optional arguments:
speaker id map file. speaker id map file.
--voice-cloning VOICE_CLONING --voice-cloning VOICE_CLONING
whether training voice cloning model. whether training voice cloning model.
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc} --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,wavernn_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,style_melgan_csmsc}
Choose vocoder type of tts task. Choose vocoder type of tts task.
--voc_config VOC_CONFIG --voc_config VOC_CONFIG
Config of voc. Use deault config when it is None. Config of voc.
--voc_ckpt VOC_CKPT Checkpoint file of voc. --voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize --voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc. spectrogram when training voc.
...@@ -161,12 +160,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_outp ...@@ -161,12 +160,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_outp
``` ```
```text ```text
usage: synthesize_e2e.py [-h] usage: synthesize_e2e.py [-h]
[--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}] [--am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}]
[--am_config AM_CONFIG] [--am_ckpt AM_CKPT] [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
[--am_stat AM_STAT] [--phones_dict PHONES_DICT] [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
[--tones_dict TONES_DICT] [--tones_dict TONES_DICT]
[--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID] [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
[--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}] [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}]
[--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
[--voc_stat VOC_STAT] [--lang LANG] [--voc_stat VOC_STAT] [--lang LANG]
[--inference_dir INFERENCE_DIR] [--ngpu NGPU] [--inference_dir INFERENCE_DIR] [--ngpu NGPU]
...@@ -176,11 +175,10 @@ Synthesize with acoustic model & vocoder ...@@ -176,11 +175,10 @@ Synthesize with acoustic model & vocoder
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk} --am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech}
Choose acoustic model type of tts task. Choose acoustic model type of tts task.
--am_config AM_CONFIG --am_config AM_CONFIG
Config of acoustic model. Use deault config when it is Config of acoustic model.
None.
--am_ckpt AM_CKPT Checkpoint file of acoustic model. --am_ckpt AM_CKPT Checkpoint file of acoustic model.
--am_stat AM_STAT mean and standard deviation used to normalize --am_stat AM_STAT mean and standard deviation used to normalize
spectrogram when training acoustic model. spectrogram when training acoustic model.
...@@ -191,10 +189,10 @@ optional arguments: ...@@ -191,10 +189,10 @@ optional arguments:
--speaker_dict SPEAKER_DICT --speaker_dict SPEAKER_DICT
speaker id map file. speaker id map file.
--spk_id SPK_ID spk id for multi speaker acoustic model --spk_id SPK_ID spk id for multi speaker acoustic model
--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc} --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}
Choose vocoder type of tts task. Choose vocoder type of tts task.
--voc_config VOC_CONFIG --voc_config VOC_CONFIG
Config of voc. Use deault config when it is None. Config of voc.
--voc_ckpt VOC_CKPT Checkpoint file of voc. --voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize --voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc. spectrogram when training voc.
...@@ -207,9 +205,9 @@ optional arguments: ...@@ -207,9 +205,9 @@ optional arguments:
output dir. output dir.
``` ```
1. `--am` is acoustic model type with the format {model_name}_{dataset} 1. `--am` is acoustic model type with the format {model_name}_{dataset}
2. `--am_config`, `--am_checkpoint`, `--am_stat`, `--phones_dict` `--speaker_dict` are arguments for acoustic model, which correspond to the 5 files in the fastspeech2 pretrained model. 2. `--am_config`, `--am_ckpt`, `--am_stat`, `--phones_dict` `--speaker_dict` are arguments for acoustic model, which correspond to the 5 files in the fastspeech2 pretrained model.
3. `--voc` is vocoder type with the format {model_name}_{dataset} 3. `--voc` is vocoder type with the format {model_name}_{dataset}
4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model. 4. `--voc_config`, `--voc_ckpt`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
5. `--lang` is the model language, which can be `zh` or `en`. 5. `--lang` is the model language, which can be `zh` or `en`.
6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder. 6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder.
7. `--text` is the text file, which contains sentences to synthesize. 7. `--text` is the text file, which contains sentences to synthesize.
......
...@@ -70,7 +70,7 @@ Train a ParallelWaveGAN model. ...@@ -70,7 +70,7 @@ Train a ParallelWaveGAN model.
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--config CONFIG config file to overwrite default config. --config CONFIG ParallelWaveGAN config file.
--train-metadata TRAIN_METADATA --train-metadata TRAIN_METADATA
training data. training data.
--dev-metadata DEV_METADATA --dev-metadata DEV_METADATA
......
...@@ -62,15 +62,13 @@ Here's the complete help message. ...@@ -62,15 +62,13 @@ Here's the complete help message.
```text ```text
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA] usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR] [--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
[--ngpu NGPU] [--batch-size BATCH_SIZE] [--max-iter MAX_ITER] [--ngpu NGPU]
[--run-benchmark RUN_BENCHMARK]
[--profiler_options PROFILER_OPTIONS]
Train a ParallelWaveGAN model. Train a HiFiGAN model.
optional arguments: optional arguments:
-h, --help show this help message and exit -h, --help show this help message and exit
--config CONFIG config file to overwrite default config. --config CONFIG HiFiGAN config file.
--train-metadata TRAIN_METADATA --train-metadata TRAIN_METADATA
training data. training data.
--dev-metadata DEV_METADATA --dev-metadata DEV_METADATA
...@@ -78,19 +76,6 @@ optional arguments: ...@@ -78,19 +76,6 @@ optional arguments:
--output-dir OUTPUT_DIR --output-dir OUTPUT_DIR
output dir. output dir.
--ngpu NGPU if ngpu == 0, use cpu. --ngpu NGPU if ngpu == 0, use cpu.
benchmark:
arguments related to benchmark.
--batch-size BATCH_SIZE
batch size.
--max-iter MAX_ITER train max steps.
--run-benchmark RUN_BENCHMARK
runing benchmark or not, if True, use the --batch-size
and --max-iter.
--profiler_options PROFILER_OPTIONS
The option of profiler, which should be in format
"key1=value1;key2=value2;key3=value3".
``` ```
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
......
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright 2021 Xiaomi Corporation (Author: Yongqing Wang) # Copyright 2021 Xiaomi Corporation (Author: Yongqing Wang)
# Mobvoi Inc(Author: Di Wu, Binbin Zhang) # Mobvoi Inc(Author: Di Wu, Binbin Zhang)
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License. # you may not use this file except in compliance with the License.
# You may obtain a copy of the License at # You may obtain a copy of the License at
...@@ -24,6 +13,7 @@ ...@@ -24,6 +13,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import argparse import argparse
import json import json
import os import os
......
...@@ -13,14 +13,7 @@ ...@@ -13,14 +13,7 @@
# limitations under the License. # limitations under the License.
import _locale import _locale
from .asr import ASRExecutor
from .base_commands import BaseCommand from .base_commands import BaseCommand
from .base_commands import HelpCommand from .base_commands import HelpCommand
from .cls import CLSExecutor
from .st import STExecutor
from .stats import StatsExecutor
from .text import TextExecutor
from .tts import TTSExecutor
from .vector import VectorExecutor
_locale._getdefaultlocale = (lambda *args: ['en_US', 'utf8']) _locale._getdefaultlocale = (lambda *args: ['en_US', 'utf8'])
...@@ -29,7 +29,6 @@ from yacs.config import CfgNode ...@@ -29,7 +29,6 @@ from yacs.config import CfgNode
from ..download import get_path_from_url from ..download import get_path_from_url
from ..executor import BaseExecutor from ..executor import BaseExecutor
from ..log import logger from ..log import logger
from ..utils import cli_register
from ..utils import CLI_TIMER from ..utils import CLI_TIMER
from ..utils import MODEL_HOME from ..utils import MODEL_HOME
from ..utils import stats_wrapper from ..utils import stats_wrapper
...@@ -45,8 +44,6 @@ __all__ = ['ASRExecutor'] ...@@ -45,8 +44,6 @@ __all__ = ['ASRExecutor']
@timer_register @timer_register
@cli_register(
name='paddlespeech.asr', description='Speech to text infer command.')
class ASRExecutor(BaseExecutor): class ASRExecutor(BaseExecutor):
def __init__(self): def __init__(self):
super().__init__() super().__init__()
......
...@@ -15,6 +15,7 @@ from typing import List ...@@ -15,6 +15,7 @@ from typing import List
from .entry import commands from .entry import commands
from .utils import cli_register from .utils import cli_register
from .utils import explicit_command_register
from .utils import get_command from .utils import get_command
__all__ = [ __all__ = [
...@@ -73,3 +74,20 @@ class VersionCommand: ...@@ -73,3 +74,20 @@ class VersionCommand:
print(msg) print(msg)
return True return True
# Dynamic import when running specific command
_commands = {
'asr': ['Speech to text infer command.', 'ASRExecutor'],
'cls': ['Audio classification infer command.', 'CLSExecutor'],
'st': ['Speech translation infer command.', 'STExecutor'],
'text': ['Text command.', 'TextExecutor'],
'tts': ['Text to Speech infer command.', 'TTSExecutor'],
'vector': ['Speech to vector embedding infer command.', 'VectorExecutor'],
}
for com, info in _commands.items():
explicit_command_register(
name='paddlespeech.{}'.format(com),
description=info[0],
cls='paddlespeech.cli.{}.{}'.format(com, info[1]))
...@@ -27,7 +27,6 @@ from paddlespeech.utils.dynamic_import import dynamic_import ...@@ -27,7 +27,6 @@ from paddlespeech.utils.dynamic_import import dynamic_import
from ..executor import BaseExecutor from ..executor import BaseExecutor
from ..log import logger from ..log import logger
from ..utils import cli_register
from ..utils import stats_wrapper from ..utils import stats_wrapper
from .pretrained_models import model_alias from .pretrained_models import model_alias
from .pretrained_models import pretrained_models from .pretrained_models import pretrained_models
...@@ -36,8 +35,6 @@ from .pretrained_models import pretrained_models ...@@ -36,8 +35,6 @@ from .pretrained_models import pretrained_models
__all__ = ['CLSExecutor'] __all__ = ['CLSExecutor']
@cli_register(
name='paddlespeech.cls', description='Audio classification infer command.')
class CLSExecutor(BaseExecutor): class CLSExecutor(BaseExecutor):
def __init__(self): def __init__(self):
super().__init__() super().__init__()
...@@ -246,4 +243,4 @@ class CLSExecutor(BaseExecutor): ...@@ -246,4 +243,4 @@ class CLSExecutor(BaseExecutor):
self.infer() self.infer()
res = self.postprocess(topk) # Retrieve result of cls. res = self.postprocess(topk) # Retrieve result of cls.
return res return res
\ No newline at end of file
...@@ -34,6 +34,11 @@ def _execute(): ...@@ -34,6 +34,11 @@ def _execute():
# The method 'execute' of a command instance returns 'True' for a success # The method 'execute' of a command instance returns 'True' for a success
# while 'False' for a failure. Here converts this result into a exit status # while 'False' for a failure. Here converts this result into a exit status
# in bash: 0 for a success and 1 for a failure. # in bash: 0 for a success and 1 for a failure.
if not callable(com['_entry']):
i = com['_entry'].rindex('.')
module, cls = com['_entry'][:i], com['_entry'][i + 1:]
exec("from {} import {}".format(module, cls))
com['_entry'] = locals()[cls]
status = 0 if com['_entry']().execute(sys.argv[idx:]) else 1 status = 0 if com['_entry']().execute(sys.argv[idx:]) else 1
return status return status
......
...@@ -28,7 +28,6 @@ from yacs.config import CfgNode ...@@ -28,7 +28,6 @@ from yacs.config import CfgNode
from ..executor import BaseExecutor from ..executor import BaseExecutor
from ..log import logger from ..log import logger
from ..utils import cli_register
from ..utils import download_and_decompress from ..utils import download_and_decompress
from ..utils import MODEL_HOME from ..utils import MODEL_HOME
from ..utils import stats_wrapper from ..utils import stats_wrapper
...@@ -42,8 +41,6 @@ from paddlespeech.utils.dynamic_import import dynamic_import ...@@ -42,8 +41,6 @@ from paddlespeech.utils.dynamic_import import dynamic_import
__all__ = ["STExecutor"] __all__ = ["STExecutor"]
@cli_register(
name="paddlespeech.st", description="Speech translation infer command.")
class STExecutor(BaseExecutor): class STExecutor(BaseExecutor):
def __init__(self): def __init__(self):
super().__init__() super().__init__()
......
...@@ -23,7 +23,6 @@ import paddle ...@@ -23,7 +23,6 @@ import paddle
from ..executor import BaseExecutor from ..executor import BaseExecutor
from ..log import logger from ..log import logger
from ..utils import cli_register
from ..utils import stats_wrapper from ..utils import stats_wrapper
from .pretrained_models import model_alias from .pretrained_models import model_alias
from .pretrained_models import pretrained_models from .pretrained_models import pretrained_models
...@@ -33,7 +32,6 @@ from paddlespeech.utils.dynamic_import import dynamic_import ...@@ -33,7 +32,6 @@ from paddlespeech.utils.dynamic_import import dynamic_import
__all__ = ['TextExecutor'] __all__ = ['TextExecutor']
@cli_register(name='paddlespeech.text', description='Text infer command.')
class TextExecutor(BaseExecutor): class TextExecutor(BaseExecutor):
def __init__(self): def __init__(self):
super().__init__() super().__init__()
......
...@@ -28,7 +28,6 @@ from yacs.config import CfgNode ...@@ -28,7 +28,6 @@ from yacs.config import CfgNode
from ..executor import BaseExecutor from ..executor import BaseExecutor
from ..log import logger from ..log import logger
from ..utils import cli_register
from ..utils import stats_wrapper from ..utils import stats_wrapper
from .pretrained_models import model_alias from .pretrained_models import model_alias
from .pretrained_models import pretrained_models from .pretrained_models import pretrained_models
...@@ -40,8 +39,6 @@ from paddlespeech.utils.dynamic_import import dynamic_import ...@@ -40,8 +39,6 @@ from paddlespeech.utils.dynamic_import import dynamic_import
__all__ = ['TTSExecutor'] __all__ = ['TTSExecutor']
@cli_register(
name='paddlespeech.tts', description='Text to Speech infer command.')
class TTSExecutor(BaseExecutor): class TTSExecutor(BaseExecutor):
def __init__(self): def __init__(self):
super().__init__() super().__init__()
......
...@@ -41,6 +41,7 @@ requests.adapters.DEFAULT_RETRIES = 3 ...@@ -41,6 +41,7 @@ requests.adapters.DEFAULT_RETRIES = 3
__all__ = [ __all__ = [
'timer_register', 'timer_register',
'cli_register', 'cli_register',
'explicit_command_register',
'get_command', 'get_command',
'download_and_decompress', 'download_and_decompress',
'load_state_dict_from_url', 'load_state_dict_from_url',
...@@ -70,6 +71,16 @@ def cli_register(name: str, description: str='') -> Any: ...@@ -70,6 +71,16 @@ def cli_register(name: str, description: str='') -> Any:
return _warpper return _warpper
def explicit_command_register(name: str, description: str='', cls: str=''):
items = name.split('.')
com = commands
for item in items:
com = com[item]
com['_entry'] = cls
if description:
com['_description'] = description
def get_command(name: str) -> Any: def get_command(name: str) -> Any:
items = name.split('.') items = name.split('.')
com = commands com = commands
......
...@@ -28,7 +28,6 @@ from yacs.config import CfgNode ...@@ -28,7 +28,6 @@ from yacs.config import CfgNode
from ..executor import BaseExecutor from ..executor import BaseExecutor
from ..log import logger from ..log import logger
from ..utils import cli_register
from ..utils import stats_wrapper from ..utils import stats_wrapper
from .pretrained_models import model_alias from .pretrained_models import model_alias
from .pretrained_models import pretrained_models from .pretrained_models import pretrained_models
...@@ -37,9 +36,6 @@ from paddlespeech.vector.io.batch import feature_normalize ...@@ -37,9 +36,6 @@ from paddlespeech.vector.io.batch import feature_normalize
from paddlespeech.vector.modules.sid_model import SpeakerIdetification from paddlespeech.vector.modules.sid_model import SpeakerIdetification
@cli_register(
name="paddlespeech.vector",
description="Speech to vector embedding infer command.")
class VectorExecutor(BaseExecutor): class VectorExecutor(BaseExecutor):
def __init__(self): def __init__(self):
super().__init__() super().__init__()
...@@ -476,4 +472,4 @@ class VectorExecutor(BaseExecutor): ...@@ -476,4 +472,4 @@ class VectorExecutor(BaseExecutor):
else: else:
logger.info("The audio file format is right") logger.info("The audio file format is right")
return True return True
\ No newline at end of file
# Copyright (c) 2021 Binbin Zhang(binbzha@qq.com)
# 2022 Shaoqing Yu(954793264@qq.com)
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
......
# Copyright (c) 2021 Binbin Zhang(binbzha@qq.com)
# Menglong Xu
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
......
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. # Copyright (c) 2021 Binbin Zhang(binbzha@qq.com)
# 2022 Shaoqing Yu(954793264@qq.com)
# 2022 PaddlePaddle Authors. All Rights Reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License. # you may not use this file except in compliance with the License.
......
# Copyright (c) 2021 Binbin Zhang
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
......
# Copyright (c) 2021 Jingyong Hou (houjingyong@gmail.com)
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
......
...@@ -189,25 +189,6 @@ if not hasattr(paddle.Tensor, 'contiguous'): ...@@ -189,25 +189,6 @@ if not hasattr(paddle.Tensor, 'contiguous'):
paddle.static.Variable.contiguous = contiguous paddle.static.Variable.contiguous = contiguous
def size(xs: paddle.Tensor, *args: int) -> paddle.Tensor:
nargs = len(args)
assert (nargs <= 1)
s = paddle.shape(xs)
if nargs == 1:
return s[args[0]]
else:
return s
#`to_static` do not process `size` property, maybe some `paddle` api dependent on it.
logger.debug(
"override size of paddle.Tensor "
"(`to_static` do not process `size` property, maybe some `paddle` api dependent on it), remove this when fixed!"
)
paddle.Tensor.size = size
paddle.static.Variable.size = size
def view(xs: paddle.Tensor, *args: int) -> paddle.Tensor: def view(xs: paddle.Tensor, *args: int) -> paddle.Tensor:
return xs.reshape(args) return xs.reshape(args)
...@@ -219,7 +200,7 @@ if not hasattr(paddle.Tensor, 'view'): ...@@ -219,7 +200,7 @@ if not hasattr(paddle.Tensor, 'view'):
def view_as(xs: paddle.Tensor, ys: paddle.Tensor) -> paddle.Tensor: def view_as(xs: paddle.Tensor, ys: paddle.Tensor) -> paddle.Tensor:
return xs.reshape(ys.size()) return xs.reshape(paddle.shape(ys))
if not hasattr(paddle.Tensor, 'view_as'): if not hasattr(paddle.Tensor, 'view_as'):
......
...@@ -194,7 +194,7 @@ class BeamSearch(paddle.nn.Layer): ...@@ -194,7 +194,7 @@ class BeamSearch(paddle.nn.Layer):
Args: Args:
hyp (Hypothesis): Hypothesis with prefix tokens to score hyp (Hypothesis): Hypothesis with prefix tokens to score
ids (paddle.Tensor): 1D tensor of new partial tokens to score, ids (paddle.Tensor): 1D tensor of new partial tokens to score,
len(ids) < n_vocab len(ids) < n_vocab
x (paddle.Tensor): Corresponding input feature, (T, D) x (paddle.Tensor): Corresponding input feature, (T, D)
...@@ -224,14 +224,14 @@ class BeamSearch(paddle.nn.Layer): ...@@ -224,14 +224,14 @@ class BeamSearch(paddle.nn.Layer):
ids (paddle.Tensor): The partial token ids(Global) to compute topk. ids (paddle.Tensor): The partial token ids(Global) to compute topk.
Returns: Returns:
Tuple[paddle.Tensor, paddle.Tensor]: Tuple[paddle.Tensor, paddle.Tensor]:
The topk full token ids and partial token ids. The topk full token ids and partial token ids.
Their shapes are `(self.beam_size,)`. Their shapes are `(self.beam_size,)`.
i.e. (global ids, global relative local ids). i.e. (global ids, global relative local ids).
""" """
# no pre beam performed, `ids` equal to `weighted_scores` # no pre beam performed, `ids` equal to `weighted_scores`
if weighted_scores.size(0) == ids.size(0): if paddle.shape(weighted_scores)[0] == paddle.shape(ids)[0]:
top_ids = weighted_scores.topk( top_ids = weighted_scores.topk(
self.beam_size)[1] # index in n_vocab self.beam_size)[1] # index in n_vocab
return top_ids, top_ids return top_ids, top_ids
...@@ -370,13 +370,13 @@ class BeamSearch(paddle.nn.Layer): ...@@ -370,13 +370,13 @@ class BeamSearch(paddle.nn.Layer):
""" """
# set length bounds # set length bounds
if maxlenratio == 0: if maxlenratio == 0:
maxlen = x.shape[0] maxlen = paddle.shape(x)[0]
elif maxlenratio < 0: elif maxlenratio < 0:
maxlen = -1 * int(maxlenratio) maxlen = -1 * int(maxlenratio)
else: else:
maxlen = max(1, int(maxlenratio * x.size(0))) maxlen = max(1, int(maxlenratio * paddle.shape(x)[0]))
minlen = int(minlenratio * x.size(0)) minlen = int(minlenratio * paddle.shape(x)[0])
logger.info("decoder input length: " + str(x.shape[0])) logger.info("decoder input length: " + str(paddle.shape(x)[0]))
logger.info("max output length: " + str(maxlen)) logger.info("max output length: " + str(maxlen))
logger.info("min output length: " + str(minlen)) logger.info("min output length: " + str(minlen))
......
...@@ -69,7 +69,7 @@ class CTCPrefixScorer(BatchPartialScorerInterface): ...@@ -69,7 +69,7 @@ class CTCPrefixScorer(BatchPartialScorerInterface):
return sc[i], st[i] return sc[i], st[i]
else: # for CTCPrefixScorePD (need new_id > 0) else: # for CTCPrefixScorePD (need new_id > 0)
r, log_psi, f_min, f_max, scoring_idmap = state r, log_psi, f_min, f_max, scoring_idmap = state
s = log_psi[i, new_id].expand(log_psi.size(1)) s = log_psi[i, new_id].expand(paddle.shape(log_psi)[1])
if scoring_idmap is not None: if scoring_idmap is not None:
return r[:, :, i, scoring_idmap[i, new_id]], s, f_min, f_max return r[:, :, i, scoring_idmap[i, new_id]], s, f_min, f_max
else: else:
...@@ -107,7 +107,7 @@ class CTCPrefixScorer(BatchPartialScorerInterface): ...@@ -107,7 +107,7 @@ class CTCPrefixScorer(BatchPartialScorerInterface):
""" """
logp = self.ctc.log_softmax(x.unsqueeze(0)) # assuming batch_size = 1 logp = self.ctc.log_softmax(x.unsqueeze(0)) # assuming batch_size = 1
xlen = paddle.to_tensor([logp.size(1)]) xlen = paddle.to_tensor([paddle.shape(logp)[1]])
self.impl = CTCPrefixScorePD(logp, xlen, 0, self.eos) self.impl = CTCPrefixScorePD(logp, xlen, 0, self.eos)
return None return None
......
...@@ -33,9 +33,9 @@ class CTCPrefixScorePD(): ...@@ -33,9 +33,9 @@ class CTCPrefixScorePD():
self.logzero = -10000000000.0 self.logzero = -10000000000.0
self.blank = blank self.blank = blank
self.eos = eos self.eos = eos
self.batch = x.size(0) self.batch = paddle.shape(x)[0]
self.input_length = x.size(1) self.input_length = paddle.shape(x)[1]
self.odim = x.size(2) self.odim = paddle.shape(x)[2]
self.dtype = x.dtype self.dtype = x.dtype
# Pad the rest of posteriors in the batch # Pad the rest of posteriors in the batch
...@@ -76,8 +76,7 @@ class CTCPrefixScorePD(): ...@@ -76,8 +76,7 @@ class CTCPrefixScorePD():
last_ids = [yi[-1] for yi in y] # last output label ids last_ids = [yi[-1] for yi in y] # last output label ids
n_bh = len(last_ids) # batch * hyps n_bh = len(last_ids) # batch * hyps
n_hyps = n_bh // self.batch # assuming each utterance has the same # of hyps n_hyps = n_bh // self.batch # assuming each utterance has the same # of hyps
self.scoring_num = scoring_ids.size( self.scoring_num = paddle.shape(scoring_ids)[-1] if scoring_ids is not None else 0
-1) if scoring_ids is not None else 0
# prepare state info # prepare state info
if state is None: if state is None:
r_prev = paddle.full( r_prev = paddle.full(
...@@ -153,7 +152,7 @@ class CTCPrefixScorePD(): ...@@ -153,7 +152,7 @@ class CTCPrefixScorePD():
# compute forward probabilities log(r_t^n(h)) and log(r_t^b(h)) # compute forward probabilities log(r_t^n(h)) and log(r_t^b(h))
for t in range(start, end): for t in range(start, end):
rp = r[t - 1] # (2 x BW x O') rp = r[t - 1] # (2 x BW x O')
rr = paddle.stack([rp[0], log_phi[t - 1], rp[0], rp[1]]).view( rr = paddle.stack([rp[0], log_phi[t - 1], rp[0], rp[1]]).view(
2, 2, n_bh, snum) # (2,2,BW,O') 2, 2, n_bh, snum) # (2,2,BW,O')
r[t] = paddle.logsumexp(rr, 1) + x_[:, t] r[t] = paddle.logsumexp(rr, 1) + x_[:, t]
...@@ -227,7 +226,7 @@ class CTCPrefixScorePD(): ...@@ -227,7 +226,7 @@ class CTCPrefixScorePD():
if self.x.shape[1] < x.shape[1]: # self.x (2,T,B,O); x (B,T,O) if self.x.shape[1] < x.shape[1]: # self.x (2,T,B,O); x (B,T,O)
# Pad the rest of posteriors in the batch # Pad the rest of posteriors in the batch
# TODO(takaaki-hori): need a better way without for-loops # TODO(takaaki-hori): need a better way without for-loops
xlens = [x.size(1)] xlens = [paddle.shape(x)[1]]
for i, l in enumerate(xlens): for i, l in enumerate(xlens):
if l < self.input_length: if l < self.input_length:
x[i, l:, :] = self.logzero x[i, l:, :] = self.logzero
...@@ -237,7 +236,7 @@ class CTCPrefixScorePD(): ...@@ -237,7 +236,7 @@ class CTCPrefixScorePD():
xb = xn[:, :, self.blank].unsqueeze(2).expand(-1, -1, self.odim) xb = xn[:, :, self.blank].unsqueeze(2).expand(-1, -1, self.odim)
self.x = paddle.stack([xn, xb]) # (2, T, B, O) self.x = paddle.stack([xn, xb]) # (2, T, B, O)
self.x[:, :tmp_x.shape[1], :, :] = tmp_x self.x[:, :tmp_x.shape[1], :, :] = tmp_x
self.input_length = x.size(1) self.input_length = paddle.shape(x)[1]
self.end_frames = paddle.to_tensor(xlens) - 1 self.end_frames = paddle.to_tensor(xlens) - 1
def extend_state(self, state): def extend_state(self, state):
...@@ -318,16 +317,16 @@ class CTCPrefixScore(): ...@@ -318,16 +317,16 @@ class CTCPrefixScore():
r[0, 0] = xs[0] r[0, 0] = xs[0]
r[0, 1] = self.logzero r[0, 1] = self.logzero
else: else:
# Although the code does not exactly follow Algorithm 2, # Although the code does not exactly follow Algorithm 2,
# we don't have to change it because we can assume # we don't have to change it because we can assume
# r_t(h)=0 for t < |h| in CTC forward computation # r_t(h)=0 for t < |h| in CTC forward computation
# (Note: we assume here that index t starts with 0). # (Note: we assume here that index t starts with 0).
# The purpose of this difference is to reduce the number of for-loops. # The purpose of this difference is to reduce the number of for-loops.
# https://github.com/espnet/espnet/pull/3655 # https://github.com/espnet/espnet/pull/3655
# where we start to accumulate r_t(h) from t=|h| # where we start to accumulate r_t(h) from t=|h|
# and iterate r_t(h) = (r_{t-1}(h) + ...) to T-1, # and iterate r_t(h) = (r_{t-1}(h) + ...) to T-1,
# avoiding accumulating zeros for t=1~|h|-1. # avoiding accumulating zeros for t=1~|h|-1.
# Thus, we need to set r_{|h|-1}(h) = 0, # Thus, we need to set r_{|h|-1}(h) = 0,
# i.e., r[output_length-1] = logzero, for initialization. # i.e., r[output_length-1] = logzero, for initialization.
# This is just for reducing the computation. # This is just for reducing the computation.
r[output_length - 1] = self.logzero r[output_length - 1] = self.logzero
......
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
# Copyright 2021 Mobvoi Inc. All Rights Reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License. # you may not use this file except in compliance with the License.
......
...@@ -90,7 +90,7 @@ class TransformerLM(nn.Layer, LMInterface, BatchScorerInterface): ...@@ -90,7 +90,7 @@ class TransformerLM(nn.Layer, LMInterface, BatchScorerInterface):
def _target_mask(self, ys_in_pad): def _target_mask(self, ys_in_pad):
ys_mask = ys_in_pad != 0 ys_mask = ys_in_pad != 0
m = subsequent_mask(ys_mask.size(-1)).unsqueeze(0) m = subsequent_mask(paddle.shape(ys_mask)[-1])).unsqueeze(0)
return ys_mask.unsqueeze(-2) & m return ys_mask.unsqueeze(-2) & m
def forward(self, x: paddle.Tensor, t: paddle.Tensor def forward(self, x: paddle.Tensor, t: paddle.Tensor
...@@ -112,7 +112,7 @@ class TransformerLM(nn.Layer, LMInterface, BatchScorerInterface): ...@@ -112,7 +112,7 @@ class TransformerLM(nn.Layer, LMInterface, BatchScorerInterface):
in perplexity: p(t)^{-n} = exp(-log p(t) / n) in perplexity: p(t)^{-n} = exp(-log p(t) / n)
""" """
batch_size = x.size(0) batch_size = paddle.shape(x)[0]
xm = x != 0 xm = x != 0
xlen = xm.sum(axis=1) xlen = xm.sum(axis=1)
if self.embed_drop is not None: if self.embed_drop is not None:
...@@ -122,7 +122,7 @@ class TransformerLM(nn.Layer, LMInterface, BatchScorerInterface): ...@@ -122,7 +122,7 @@ class TransformerLM(nn.Layer, LMInterface, BatchScorerInterface):
h, _ = self.encoder(emb, xlen) h, _ = self.encoder(emb, xlen)
y = self.decoder(h) y = self.decoder(h)
loss = F.cross_entropy( loss = F.cross_entropy(
y.view(-1, y.shape[-1]), t.view(-1), reduction="none") y.view(-1, paddle.shape(y)[-1]), t.view(-1), reduction="none")
mask = xm.to(loss.dtype) mask = xm.to(loss.dtype)
logp = loss * mask.view(-1) logp = loss * mask.view(-1)
nll = logp.view(batch_size, -1).sum(-1) nll = logp.view(batch_size, -1).sum(-1)
......
# Copyright 2021 Mobvoi Inc. All Rights Reserved.
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
...@@ -775,7 +776,7 @@ class U2DecodeModel(U2BaseModel): ...@@ -775,7 +776,7 @@ class U2DecodeModel(U2BaseModel):
""" """
self.eval() self.eval()
x = paddle.to_tensor(x).unsqueeze(0) x = paddle.to_tensor(x).unsqueeze(0)
ilen = x.size(1) ilen = paddle.shape(x)[1]
enc_output, _ = self._forward_encoder(x, ilen) enc_output, _ = self._forward_encoder(x, ilen)
return enc_output.squeeze(0) return enc_output.squeeze(0)
......
...@@ -11,7 +11,7 @@ ...@@ -11,7 +11,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# Modified from wenet(https://github.com/wenet-e2e/wenet)
from contextlib import nullcontext from contextlib import nullcontext
import paddle import paddle
......
...@@ -242,7 +242,7 @@ class TransformerDecoder(BatchScorerInterface, nn.Layer): ...@@ -242,7 +242,7 @@ class TransformerDecoder(BatchScorerInterface, nn.Layer):
] ]
# batch decoding # batch decoding
ys_mask = subsequent_mask(ys.size(-1)).unsqueeze(0) # (B,L,L) ys_mask = subsequent_mask(paddle.shape(ys)[-1]).unsqueeze(0) # (B,L,L)
xs_mask = make_xs_mask(xs).unsqueeze(1) # (B,1,T) xs_mask = make_xs_mask(xs).unsqueeze(1) # (B,1,T)
logp, states = self.forward_one_step( logp, states = self.forward_one_step(
xs, xs_mask, ys, ys_mask, cache=batch_state) xs, xs_mask, ys, ys_mask, cache=batch_state)
......
...@@ -115,7 +115,7 @@ class PositionalEncoding(nn.Layer, PositionalEncodingInterface): ...@@ -115,7 +115,7 @@ class PositionalEncoding(nn.Layer, PositionalEncodingInterface):
assert offset + x.shape[ assert offset + x.shape[
1] < self.max_len, "offset: {} + x.shape[1]: {} is larger than the max_len: {}".format( 1] < self.max_len, "offset: {} + x.shape[1]: {} is larger than the max_len: {}".format(
offset, x.shape[1], self.max_len) offset, x.shape[1], self.max_len)
#TODO(Hui Zhang): using T = x.size(1), __getitem__ not support Tensor #TODO(Hui Zhang): using T = paddle.shape(x)[1], __getitem__ not support Tensor
pos_emb = self.pe[:, offset:offset + T] pos_emb = self.pe[:, offset:offset + T]
x = x * self.xscale + pos_emb x = x * self.xscale + pos_emb
return self.dropout(x), self.dropout(pos_emb) return self.dropout(x), self.dropout(pos_emb)
...@@ -165,6 +165,6 @@ class RelPositionalEncoding(PositionalEncoding): ...@@ -165,6 +165,6 @@ class RelPositionalEncoding(PositionalEncoding):
1] < self.max_len, "offset: {} + x.shape[1]: {} is larger than the max_len: {}".format( 1] < self.max_len, "offset: {} + x.shape[1]: {} is larger than the max_len: {}".format(
offset, x.shape[1], self.max_len) offset, x.shape[1], self.max_len)
x = x * self.xscale x = x * self.xscale
#TODO(Hui Zhang): using x.size(1), __getitem__ not support Tensor #TODO(Hui Zhang): using paddle.shape(x)[1], __getitem__ not support Tensor
pos_emb = self.pe[:, offset:offset + x.shape[1]] pos_emb = self.pe[:, offset:offset + x.shape[1]]
return self.dropout(x), self.dropout(pos_emb) return self.dropout(x), self.dropout(pos_emb)
...@@ -218,7 +218,7 @@ class BaseEncoder(nn.Layer): ...@@ -218,7 +218,7 @@ class BaseEncoder(nn.Layer):
assert xs.shape[0] == 1 # batch size must be one assert xs.shape[0] == 1 # batch size must be one
# tmp_masks is just for interface compatibility # tmp_masks is just for interface compatibility
# TODO(Hui Zhang): stride_slice not support bool tensor # TODO(Hui Zhang): stride_slice not support bool tensor
# tmp_masks = paddle.ones([1, xs.size(1)], dtype=paddle.bool) # tmp_masks = paddle.ones([1, paddle.shape(xs)[1]], dtype=paddle.bool)
tmp_masks = paddle.ones([1, xs.shape[1]], dtype=paddle.int32) tmp_masks = paddle.ones([1, xs.shape[1]], dtype=paddle.int32)
tmp_masks = tmp_masks.unsqueeze(1) #[B=1, C=1, T] tmp_masks = tmp_masks.unsqueeze(1) #[B=1, C=1, T]
......
# Copyright 2021 Mobvoi Inc. All Rights Reserved.
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
......
...@@ -58,7 +58,7 @@ def pad_sequence(sequences: List[paddle.Tensor], ...@@ -58,7 +58,7 @@ def pad_sequence(sequences: List[paddle.Tensor],
>>> a = paddle.ones(25, 300) >>> a = paddle.ones(25, 300)
>>> b = paddle.ones(22, 300) >>> b = paddle.ones(22, 300)
>>> c = paddle.ones(15, 300) >>> c = paddle.ones(15, 300)
>>> pad_sequence([a, b, c]).size() >>> pad_sequence([a, b, c]).shape
paddle.Tensor([25, 3, 300]) paddle.Tensor([25, 3, 300])
Note: Note:
...@@ -79,10 +79,10 @@ def pad_sequence(sequences: List[paddle.Tensor], ...@@ -79,10 +79,10 @@ def pad_sequence(sequences: List[paddle.Tensor],
# assuming trailing dimensions and type of all the Tensors # assuming trailing dimensions and type of all the Tensors
# in sequences are same and fetching those from sequences[0] # in sequences are same and fetching those from sequences[0]
max_size = sequences[0].size() max_size = paddle.shape(sequences[0])
# (TODO Hui Zhang): slice not supprot `end==start` # (TODO Hui Zhang): slice not supprot `end==start`
# trailing_dims = max_size[1:] # trailing_dims = max_size[1:]
trailing_dims = max_size[1:] if max_size.ndim >= 2 else () trailing_dims = tuple(max_size[1:].numpy().tolist()) if sequences[0].ndim >= 2 else ()
max_len = max([s.shape[0] for s in sequences]) max_len = max([s.shape[0] for s in sequences])
if batch_first: if batch_first:
out_dims = (len(sequences), max_len) + trailing_dims out_dims = (len(sequences), max_len) + trailing_dims
...@@ -99,7 +99,7 @@ def pad_sequence(sequences: List[paddle.Tensor], ...@@ -99,7 +99,7 @@ def pad_sequence(sequences: List[paddle.Tensor],
if batch_first: if batch_first:
# TODO (Hui Zhang): set_value op not supprot `end==start` # TODO (Hui Zhang): set_value op not supprot `end==start`
# TODO (Hui Zhang): set_value op not support int16 # TODO (Hui Zhang): set_value op not support int16
# TODO (Hui Zhang): set_varbase 2 rank not support [0,0,...] # TODO (Hui Zhang): set_varbase 2 rank not support [0,0,...]
# out_tensor[i, :length, ...] = tensor # out_tensor[i, :length, ...] = tensor
if length != 0: if length != 0:
out_tensor[i, :length] = tensor out_tensor[i, :length] = tensor
...@@ -145,7 +145,7 @@ def add_sos_eos(ys_pad: paddle.Tensor, sos: int, eos: int, ...@@ -145,7 +145,7 @@ def add_sos_eos(ys_pad: paddle.Tensor, sos: int, eos: int,
[ 4, 5, 6, 11, -1, -1], [ 4, 5, 6, 11, -1, -1],
[ 7, 8, 9, 11, -1, -1]]) [ 7, 8, 9, 11, -1, -1]])
""" """
# TODO(Hui Zhang): using comment code, # TODO(Hui Zhang): using comment code,
#_sos = paddle.to_tensor( #_sos = paddle.to_tensor(
# [sos], dtype=paddle.long, stop_gradient=True, place=ys_pad.place) # [sos], dtype=paddle.long, stop_gradient=True, place=ys_pad.place)
#_eos = paddle.to_tensor( #_eos = paddle.to_tensor(
......
# Copyright 2021 Mobvoi Inc. All Rights Reserved.
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
# #
# Licensed under the Apache License, Version 2.0 (the "License"); # Licensed under the Apache License, Version 2.0 (the "License");
......
...@@ -243,8 +243,7 @@ def main(): ...@@ -243,8 +243,7 @@ def main():
# parse args and config and redirect to train_sp # parse args and config and redirect to train_sp
parser = argparse.ArgumentParser(description="Train a HiFiGAN model.") parser = argparse.ArgumentParser(description="Train a HiFiGAN model.")
parser.add_argument( parser.add_argument("--config", type=str, help="HiFiGAN config file.")
"--config", type=str, help="config file to overwrite default config.")
parser.add_argument("--train-metadata", type=str, help="training data.") parser.add_argument("--train-metadata", type=str, help="training data.")
parser.add_argument("--dev-metadata", type=str, help="dev data.") parser.add_argument("--dev-metadata", type=str, help="dev data.")
parser.add_argument("--output-dir", type=str, help="output dir.") parser.add_argument("--output-dir", type=str, help="output dir.")
......
...@@ -233,7 +233,7 @@ def main(): ...@@ -233,7 +233,7 @@ def main():
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
description="Train a Multi-Band MelGAN model.") description="Train a Multi-Band MelGAN model.")
parser.add_argument( parser.add_argument(
"--config", type=str, help="config file to overwrite default config.") "--config", type=str, help="Multi-Band MelGAN config file.")
parser.add_argument("--train-metadata", type=str, help="training data.") parser.add_argument("--train-metadata", type=str, help="training data.")
parser.add_argument("--dev-metadata", type=str, help="dev data.") parser.add_argument("--dev-metadata", type=str, help="dev data.")
parser.add_argument("--output-dir", type=str, help="output dir.") parser.add_argument("--output-dir", type=str, help="output dir.")
......
...@@ -208,7 +208,7 @@ def main(): ...@@ -208,7 +208,7 @@ def main():
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
description="Train a ParallelWaveGAN model.") description="Train a ParallelWaveGAN model.")
parser.add_argument( parser.add_argument(
"--config", type=str, help="config file to overwrite default config.") "--config", type=str, help="ParallelWaveGAN config file.")
parser.add_argument("--train-metadata", type=str, help="training data.") parser.add_argument("--train-metadata", type=str, help="training data.")
parser.add_argument("--dev-metadata", type=str, help="dev data.") parser.add_argument("--dev-metadata", type=str, help="dev data.")
parser.add_argument("--output-dir", type=str, help="output dir.") parser.add_argument("--output-dir", type=str, help="output dir.")
......
...@@ -224,8 +224,7 @@ def main(): ...@@ -224,8 +224,7 @@ def main():
# parse args and config and redirect to train_sp # parse args and config and redirect to train_sp
parser = argparse.ArgumentParser(description="Train a Style MelGAN model.") parser = argparse.ArgumentParser(description="Train a Style MelGAN model.")
parser.add_argument( parser.add_argument("--config", type=str, help="Style MelGAN config file.")
"--config", type=str, help="config file to overwrite default config.")
parser.add_argument("--train-metadata", type=str, help="training data.") parser.add_argument("--train-metadata", type=str, help="training data.")
parser.add_argument("--dev-metadata", type=str, help="dev data.") parser.add_argument("--dev-metadata", type=str, help="dev data.")
parser.add_argument("--output-dir", type=str, help="output dir.") parser.add_argument("--output-dir", type=str, help="output dir.")
......
...@@ -160,7 +160,7 @@ def main(): ...@@ -160,7 +160,7 @@ def main():
parser = argparse.ArgumentParser(description="Train a TransformerTTS " parser = argparse.ArgumentParser(description="Train a TransformerTTS "
"model with LJSpeech TTS dataset.") "model with LJSpeech TTS dataset.")
parser.add_argument( parser.add_argument(
"--config", type=str, help="config file to overwrite default config.") "--config", type=str, help="TransformerTTS config file.")
parser.add_argument("--train-metadata", type=str, help="training data.") parser.add_argument("--train-metadata", type=str, help="training data.")
parser.add_argument("--dev-metadata", type=str, help="dev data.") parser.add_argument("--dev-metadata", type=str, help="dev data.")
parser.add_argument("--output-dir", type=str, help="output dir.") parser.add_argument("--output-dir", type=str, help="output dir.")
......
...@@ -226,9 +226,8 @@ def train_sp(args, config): ...@@ -226,9 +226,8 @@ def train_sp(args, config):
def main(): def main():
# parse args and config and redirect to train_sp # parse args and config and redirect to train_sp
parser = argparse.ArgumentParser(description="Train a HiFiGAN model.") parser = argparse.ArgumentParser(description="Train a VITS model.")
parser.add_argument( parser.add_argument("--config", type=str, help="VITS config file")
"--config", type=str, help="config file to overwrite default config.")
parser.add_argument("--train-metadata", type=str, help="training data.") parser.add_argument("--train-metadata", type=str, help="training data.")
parser.add_argument("--dev-metadata", type=str, help="dev data.") parser.add_argument("--dev-metadata", type=str, help="dev data.")
parser.add_argument("--output-dir", type=str, help="output dir.") parser.add_argument("--output-dir", type=str, help="output dir.")
......
...@@ -180,8 +180,7 @@ def main(): ...@@ -180,8 +180,7 @@ def main():
# parse args and config and redirect to train_sp # parse args and config and redirect to train_sp
parser = argparse.ArgumentParser(description="Train a WaveRNN model.") parser = argparse.ArgumentParser(description="Train a WaveRNN model.")
parser.add_argument( parser.add_argument("--config", type=str, help="WaveRNN config file.")
"--config", type=str, help="config file to overwrite default config.")
parser.add_argument("--train-metadata", type=str, help="training data.") parser.add_argument("--train-metadata", type=str, help="training data.")
parser.add_argument("--dev-metadata", type=str, help="dev data.") parser.add_argument("--dev-metadata", type=str, help="dev data.")
parser.add_argument("--output-dir", type=str, help="output dir.") parser.add_argument("--output-dir", type=str, help="output dir.")
......
...@@ -44,13 +44,13 @@ More details please see `README.md` under `examples`. ...@@ -44,13 +44,13 @@ More details please see `README.md` under `examples`.
> If using docker please check `--privileged` is set when `docker run`. > If using docker please check `--privileged` is set when `docker run`.
* Fatal error at startup: `a function redirection which is mandatory for this platform-tool combination cannot be set up` * Fatal error at startup: `a function redirection which is mandatory for this platform-tool combination cannot be set up`
``` ```bash
apt-get install libc6-dbg apt-get install libc6-dbg
``` ```
* Install * Install
``` ```bash
pushd tools pushd tools
./setup_valgrind.sh ./setup_valgrind.sh
popd popd
...@@ -59,4 +59,4 @@ popd ...@@ -59,4 +59,4 @@ popd
## TODO ## TODO
### Deepspeech2 with linear feature ### Deepspeech2 with linear feature
* DecibelNormalizer: there is a little bit difference between offline and online db norm. The computation of online db norm read feature chunk by chunk, which causes the feature size is different with offline db norm. In normalizer.cc:73, the samples.size() is different, which causes the difference of result. * DecibelNormalizer: there is a small difference between the offline and online db norm. The computation of online db norm reads features chunk by chunk, which causes the feature size to be different different with offline db norm. In `normalizer.cc:73`, the `samples.size()` is different, which causes the different result.
...@@ -25,7 +25,7 @@ paddlespeech asr --model deepspeech2offline_librispeech --lang en --input ./en.w ...@@ -25,7 +25,7 @@ paddlespeech asr --model deepspeech2offline_librispeech --lang en --input ./en.w
# long audio restriction # long audio restriction
{ {
wget -c https://paddlespeech.bj.bcebos.com/datasets/single_wav/zh/test_long_audio_01.wav wget -c https://paddlespeech.bj.bcebos.com/datasets/single_wav/zh/test_long_audio_01.wav
paddlespeech asr --input test_long_audio_01.wav paddlespeech asr --model deepspeech2online_wenetspeech --input test_long_audio_01.wav -y
if [ $? -ne 255 ]; then if [ $? -ne 255 ]; then
echo -e "\e[1;31mTime restriction not passed\e[0m" echo -e "\e[1;31mTime restriction not passed\e[0m"
exit 1 exit 1
...@@ -54,7 +54,7 @@ paddlespeech tts --am tacotron2_ljspeech --voc pwgan_ljspeech --lang en --input ...@@ -54,7 +54,7 @@ paddlespeech tts --am tacotron2_ljspeech --voc pwgan_ljspeech --lang en --input
# Speech Translation (only support linux) # Speech Translation (only support linux)
paddlespeech st --input ./en.wav paddlespeech st --input ./en.wav
# Speaker Verification # Speaker Verification
wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
paddlespeech vector --task spk --input 85236145389.wav paddlespeech vector --task spk --input 85236145389.wav
...@@ -65,7 +65,7 @@ echo -e "demo1 85236145389.wav \n demo2 85236145389.wav" > vec.job ...@@ -65,7 +65,7 @@ echo -e "demo1 85236145389.wav \n demo2 85236145389.wav" > vec.job
paddlespeech vector --task spk --input vec.job paddlespeech vector --task spk --input vec.job
echo -e "demo3 85236145389.wav \n demo4 85236145389.wav" | paddlespeech vector --task spk echo -e "demo3 85236145389.wav \n demo4 85236145389.wav" | paddlespeech vector --task spk
rm 85236145389.wav rm 85236145389.wav
rm vec.job rm vec.job
# shell pipeline # shell pipeline
......
* [python_kaldi_features](https://github.com/ZitengWang/python_kaldi_features) * [python_kaldi_features](https://github.com/ZitengWang/python_kaldi_features)
commit: fc1bd6240c2008412ab64dc25045cd872f5e126c commit: fc1bd6240c2008412ab64dc25045cd872f5e126c
ref: https://zhuanlan.zhihu.com/p/55371926 ref: https://zhuanlan.zhihu.com/p/55371926
licence: MIT license: MIT
* [python-pinyin](https://github.com/mozillazg/python-pinyin.git) * [python-pinyin](https://github.com/mozillazg/python-pinyin.git)
commit: 55e524aa1b7b8eec3d15c5306043c6cdd5938b03 commit: 55e524aa1b7b8eec3d15c5306043c6cdd5938b03
licence: MIT license: MIT
* [zhon](https://github.com/tsroten/zhon) * [zhon](https://github.com/tsroten/zhon)
commit: 09bf543696277f71de502506984661a60d24494c commit: 09bf543696277f71de502506984661a60d24494c
licence: MIT license: MIT
* [pymmseg-cpp](https://github.com/pluskid/pymmseg-cpp.git) * [pymmseg-cpp](https://github.com/pluskid/pymmseg-cpp.git)
commit: b76465045717fbb4f118c4fbdd24ce93bab10a6d commit: b76465045717fbb4f118c4fbdd24ce93bab10a6d
licence: MIT license: MIT
* [chinese_text_normalization](https://github.com/speechio/chinese_text_normalization.git) * [chinese_text_normalization](https://github.com/speechio/chinese_text_normalization.git)
commit: 9e92c7bf2d6b5a7974305406d8e240045beac51c commit: 9e92c7bf2d6b5a7974305406d8e240045beac51c
licence: MIT license: MIT
* [phkit](https://github.com/KuangDD/phkit.git) * [phkit](https://github.com/KuangDD/phkit.git)
commit: b2100293c1e36da531d7f30bd52c9b955a649522 commit: b2100293c1e36da531d7f30bd52c9b955a649522
licence: None license: None
* [nnAudio](https://github.com/KinWaiCheuk/nnAudio.git) * [nnAudio](https://github.com/KinWaiCheuk/nnAudio.git)
licence: MIT license: MIT
...@@ -5,4 +5,4 @@ score.h and score.cpp is under the LGPL license. ...@@ -5,4 +5,4 @@ score.h and score.cpp is under the LGPL license.
The two files include the header files from KenLM project. The two files include the header files from KenLM project.
For the rest: For the rest:
The default licence of paddlespeech-ctcdecoders is Apache License 2.0. The default license of paddlespeech-ctcdecoders is Apache License 2.0.
#!/usr/bin/env python3 #!/usr/bin/env python3
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# CopyRight WeNet Apache-2.0 License # Copyright 2021 Mobvoi Inc. All Rights Reserved.
import codecs import codecs
import re import re
import sys import sys
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册