Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
DeepSpeech
提交
175c39b4
D
DeepSpeech
项目概览
PaddlePaddle
/
DeepSpeech
1 年多 前同步成功
通知
207
Star
8425
Fork
1598
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
245
列表
看板
标记
里程碑
合并请求
3
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
D
DeepSpeech
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
245
Issue
245
列表
看板
标记
里程碑
合并请求
3
合并请求
3
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
未验证
提交
175c39b4
编写于
3月 02, 2022
作者:
小湉湉
提交者:
GitHub
3月 02, 2022
浏览文件
操作
浏览文件
下载
差异文件
Merge pull request #1511 from yt605155624/pre_fix_for_streaming
[TTS]add rtf for synthesize, add more vocoder for synthesize.sh
上级
4ad325bc
641984ae
变更
11
隐藏空白更改
内联
并排
Showing
11 changed file
with
443 addition
and
150 deletion
+443
-150
examples/csmsc/tts0/local/synthesize.sh
examples/csmsc/tts0/local/synthesize.sh
+94
-14
examples/csmsc/tts0/local/synthesize_e2e.sh
examples/csmsc/tts0/local/synthesize_e2e.sh
+9
-8
examples/csmsc/tts2/local/synthesize.sh
examples/csmsc/tts2/local/synthesize.sh
+100
-15
examples/csmsc/tts2/local/synthesize_e2e.sh
examples/csmsc/tts2/local/synthesize_e2e.sh
+7
-7
examples/csmsc/tts3/local/synthesize.sh
examples/csmsc/tts3/local/synthesize.sh
+94
-14
examples/csmsc/tts3/local/synthesize_e2e.sh
examples/csmsc/tts3/local/synthesize_e2e.sh
+7
-6
paddlespeech/t2s/exps/synthesize.py
paddlespeech/t2s/exps/synthesize.py
+63
-31
paddlespeech/t2s/exps/synthesize_e2e.py
paddlespeech/t2s/exps/synthesize_e2e.py
+58
-48
paddlespeech/t2s/exps/wavernn/synthesize.py
paddlespeech/t2s/exps/wavernn/synthesize.py
+1
-1
paddlespeech/t2s/models/melgan/melgan.py
paddlespeech/t2s/models/melgan/melgan.py
+1
-1
paddlespeech/t2s/models/wavernn/wavernn.py
paddlespeech/t2s/models/wavernn/wavernn.py
+9
-5
未找到文件。
examples/csmsc/tts0/local/synthesize.sh
浏览文件 @
175c39b4
...
...
@@ -3,18 +3,98 @@
config_path
=
$1
train_output_path
=
$2
ckpt_name
=
$3
stage
=
0
stop_stage
=
0
FLAGS_allocator_strategy
=
naive_best_fit
\
FLAGS_fraction_of_gpu_memory_to_use
=
0.01
\
python3
${
BIN_DIR
}
/../synthesize.py
\
--am
=
tacotron2_csmsc
\
--am_config
=
${
config_path
}
\
--am_ckpt
=
${
train_output_path
}
/checkpoints/
${
ckpt_name
}
\
--am_stat
=
dump/train/speech_stats.npy
\
--voc
=
pwgan_csmsc
\
--voc_config
=
pwg_baker_ckpt_0.4/pwg_default.yaml
\
--voc_ckpt
=
pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz
\
--voc_stat
=
pwg_baker_ckpt_0.4/pwg_stats.npy
\
--test_metadata
=
dump/test/norm/metadata.jsonl
\
--output_dir
=
${
train_output_path
}
/test
\
--phones_dict
=
dump/phone_id_map.txt
# pwgan
if
[
${
stage
}
-le
0
]
&&
[
${
stop_stage
}
-ge
0
]
;
then
FLAGS_allocator_strategy
=
naive_best_fit
\
FLAGS_fraction_of_gpu_memory_to_use
=
0.01
\
python3
${
BIN_DIR
}
/../synthesize.py
\
--am
=
tacotron2_csmsc
\
--am_config
=
${
config_path
}
\
--am_ckpt
=
${
train_output_path
}
/checkpoints/
${
ckpt_name
}
\
--am_stat
=
dump/train/speech_stats.npy
\
--voc
=
pwgan_csmsc
\
--voc_config
=
pwg_baker_ckpt_0.4/pwg_default.yaml
\
--voc_ckpt
=
pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz
\
--voc_stat
=
pwg_baker_ckpt_0.4/pwg_stats.npy
\
--test_metadata
=
dump/test/norm/metadata.jsonl
\
--output_dir
=
${
train_output_path
}
/test
\
--phones_dict
=
dump/phone_id_map.txt
fi
# for more GAN Vocoders
# multi band melgan
if
[
${
stage
}
-le
1
]
&&
[
${
stop_stage
}
-ge
1
]
;
then
FLAGS_allocator_strategy
=
naive_best_fit
\
FLAGS_fraction_of_gpu_memory_to_use
=
0.01
\
python3
${
BIN_DIR
}
/../synthesize.py
\
--am
=
tacotron2_csmsc
\
--am_config
=
${
config_path
}
\
--am_ckpt
=
${
train_output_path
}
/checkpoints/
${
ckpt_name
}
\
--am_stat
=
dump/train/speech_stats.npy
\
--voc
=
mb_melgan_csmsc
\
--voc_config
=
mb_melgan_csmsc_ckpt_0.1.1/default.yaml
\
--voc_ckpt
=
mb_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1000000.pdz
\
--voc_stat
=
mb_melgan_csmsc_ckpt_0.1.1/feats_stats.npy
\
--test_metadata
=
dump/test/norm/metadata.jsonl
\
--output_dir
=
${
train_output_path
}
/test
\
--phones_dict
=
dump/phone_id_map.txt
fi
# style melgan
if
[
${
stage
}
-le
2
]
&&
[
${
stop_stage
}
-ge
2
]
;
then
FLAGS_allocator_strategy
=
naive_best_fit
\
FLAGS_fraction_of_gpu_memory_to_use
=
0.01
\
python3
${
BIN_DIR
}
/../synthesize.py
\
--am
=
tacotron2_csmsc
\
--am_config
=
${
config_path
}
\
--am_ckpt
=
${
train_output_path
}
/checkpoints/
${
ckpt_name
}
\
--am_stat
=
dump/train/speech_stats.npy
\
--voc
=
style_melgan_csmsc
\
--voc_config
=
style_melgan_csmsc_ckpt_0.1.1/default.yaml
\
--voc_ckpt
=
style_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1500000.pdz
\
--voc_stat
=
style_melgan_csmsc_ckpt_0.1.1/feats_stats.npy
\
--test_metadata
=
dump/test/norm/metadata.jsonl
\
--output_dir
=
${
train_output_path
}
/test
\
--phones_dict
=
dump/phone_id_map.txt
fi
# hifigan
if
[
${
stage
}
-le
3
]
&&
[
${
stop_stage
}
-ge
3
]
;
then
echo
"in hifigan syn"
FLAGS_allocator_strategy
=
naive_best_fit
\
FLAGS_fraction_of_gpu_memory_to_use
=
0.01
\
python3
${
BIN_DIR
}
/../synthesize.py
\
--am
=
tacotron2_csmsc
\
--am_config
=
${
config_path
}
\
--am_ckpt
=
${
train_output_path
}
/checkpoints/
${
ckpt_name
}
\
--am_stat
=
dump/train/speech_stats.npy
\
--voc
=
hifigan_csmsc
\
--voc_config
=
hifigan_csmsc_ckpt_0.1.1/default.yaml
\
--voc_ckpt
=
hifigan_csmsc_ckpt_0.1.1/snapshot_iter_2500000.pdz
\
--voc_stat
=
hifigan_csmsc_ckpt_0.1.1/feats_stats.npy
\
--test_metadata
=
dump/test/norm/metadata.jsonl
\
--output_dir
=
${
train_output_path
}
/test
\
--phones_dict
=
dump/phone_id_map.txt
fi
# wavernn
if
[
${
stage
}
-le
4
]
&&
[
${
stop_stage
}
-ge
4
]
;
then
echo
"in wavernn syn"
FLAGS_allocator_strategy
=
naive_best_fit
\
FLAGS_fraction_of_gpu_memory_to_use
=
0.01
\
python3
${
BIN_DIR
}
/../synthesize.py
\
--am
=
tacotron2_csmsc
\
--am_config
=
${
config_path
}
\
--am_ckpt
=
${
train_output_path
}
/checkpoints/
${
ckpt_name
}
\
--am_stat
=
dump/train/speech_stats.npy
\
--voc
=
wavernn_csmsc
\
--voc_config
=
wavernn_csmsc_ckpt_0.2.0/default.yaml
\
--voc_ckpt
=
wavernn_csmsc_ckpt_0.2.0/snapshot_iter_400000.pdz
\
--voc_stat
=
wavernn_csmsc_ckpt_0.2.0/feats_stats.npy
\
--test_metadata
=
dump/test/norm/metadata.jsonl
\
--output_dir
=
${
train_output_path
}
/test
\
--phones_dict
=
dump/phone_id_map.txt
fi
examples/csmsc/tts0/local/synthesize_e2e.sh
浏览文件 @
175c39b4
...
...
@@ -8,6 +8,7 @@ stage=0
stop_stage
=
0
# TODO: tacotron2 动转静的结果没有静态图的响亮, 可能还是 decode 的时候某个函数动静不对齐
# pwgan
if
[
${
stage
}
-le
0
]
&&
[
${
stop_stage
}
-ge
0
]
;
then
FLAGS_allocator_strategy
=
naive_best_fit
\
FLAGS_fraction_of_gpu_memory_to_use
=
0.01
\
...
...
@@ -39,14 +40,14 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
--am_ckpt
=
${
train_output_path
}
/checkpoints/
${
ckpt_name
}
\
--am_stat
=
dump/train/speech_stats.npy
\
--voc
=
mb_melgan_csmsc
\
--voc_config
=
mb_melgan_
baker_finetune_ckpt_0.5/finetune
.yaml
\
--voc_ckpt
=
mb_melgan_
baker_finetune_ckpt_0.5/snapshot_iter_2
000000.pdz
\
--voc_stat
=
mb_melgan_
baker_finetune_ckpt_0.5
/feats_stats.npy
\
--voc_config
=
mb_melgan_
csmsc_ckpt_0.1.1/default
.yaml
\
--voc_ckpt
=
mb_melgan_
csmsc_ckpt_0.1.1/snapshot_iter_1
000000.pdz
\
--voc_stat
=
mb_melgan_
csmsc_ckpt_0.1.1
/feats_stats.npy
\
--lang
=
zh
\
--text
=
${
BIN_DIR
}
/../sentences.txt
\
--output_dir
=
${
train_output_path
}
/test_e2e
\
--
inference_dir
=
${
train_output_path
}
/inference
\
--
phones_dict
=
dump/phone_id_map.txt
--
phones_dict
=
dump/phone_id_map.txt
\
--
inference_dir
=
${
train_output_path
}
/inference
fi
# the pretrained models haven't release now
...
...
@@ -88,8 +89,8 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
--lang
=
zh
\
--text
=
${
BIN_DIR
}
/../sentences.txt
\
--output_dir
=
${
train_output_path
}
/test_e2e
\
--
inference_dir
=
${
train_output_path
}
/inference
\
--
phones_dict
=
dump/phone_id_map.txt
--
phones_dict
=
dump/phone_id_map.txt
\
--
inference_dir
=
${
train_output_path
}
/inference
fi
# wavernn
...
...
@@ -111,4 +112,4 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
--output_dir
=
${
train_output_path
}
/test_e2e
\
--phones_dict
=
dump/phone_id_map.txt
\
--inference_dir
=
${
train_output_path
}
/inference
fi
\ No newline at end of file
fi
examples/csmsc/tts2/local/synthesize.sh
浏览文件 @
175c39b4
#!/bin/bash
config_path
=
$1
train_output_path
=
$2
ckpt_name
=
$3
stage
=
0
stop_stage
=
0
# pwgan
if
[
${
stage
}
-le
0
]
&&
[
${
stop_stage
}
-ge
0
]
;
then
FLAGS_allocator_strategy
=
naive_best_fit
\
FLAGS_fraction_of_gpu_memory_to_use
=
0.01
\
python3
${
BIN_DIR
}
/../synthesize.py
\
--am
=
speedyspeech_csmsc
\
--am_config
=
${
config_path
}
\
--am_ckpt
=
${
train_output_path
}
/checkpoints/
${
ckpt_name
}
\
--am_stat
=
dump/train/speech_stats.npy
\
--voc
=
pwgan_csmsc
\
--voc_config
=
pwg_baker_ckpt_0.4/pwg_default.yaml
\
--voc_ckpt
=
pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz
\
--voc_stat
=
pwg_baker_ckpt_0.4/pwg_stats.npy
\
--test_metadata
=
dump/test/norm/metadata.jsonl
\
--output_dir
=
${
train_output_path
}
/test
\
--phones_dict
=
dump/phone_id_map.txt
\
--tones_dict
=
dump/tone_id_map.txt
fi
# for more GAN Vocoders
# multi band melgan
if
[
${
stage
}
-le
1
]
&&
[
${
stop_stage
}
-ge
1
]
;
then
FLAGS_allocator_strategy
=
naive_best_fit
\
FLAGS_fraction_of_gpu_memory_to_use
=
0.01
\
python3
${
BIN_DIR
}
/../synthesize.py
\
--am
=
speedyspeech_csmsc
\
--am_config
=
${
config_path
}
\
--am_ckpt
=
${
train_output_path
}
/checkpoints/
${
ckpt_name
}
\
--am_stat
=
dump/train/speech_stats.npy
\
--voc
=
mb_melgan_csmsc
\
--voc_config
=
mb_melgan_csmsc_ckpt_0.1.1/default.yaml
\
--voc_ckpt
=
mb_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1000000.pdz
\
--voc_stat
=
mb_melgan_csmsc_ckpt_0.1.1/feats_stats.npy
\
--test_metadata
=
dump/test/norm/metadata.jsonl
\
--output_dir
=
${
train_output_path
}
/test
\
--phones_dict
=
dump/phone_id_map.txt
\
--tones_dict
=
dump/tone_id_map.txt
fi
# style melgan
if
[
${
stage
}
-le
2
]
&&
[
${
stop_stage
}
-ge
2
]
;
then
FLAGS_allocator_strategy
=
naive_best_fit
\
FLAGS_fraction_of_gpu_memory_to_use
=
0.01
\
python3
${
BIN_DIR
}
/../synthesize.py
\
--am
=
speedyspeech_csmsc
\
--am_config
=
${
config_path
}
\
--am_ckpt
=
${
train_output_path
}
/checkpoints/
${
ckpt_name
}
\
--am_stat
=
dump/train/speech_stats.npy
\
--voc
=
style_melgan_csmsc
\
--voc_config
=
style_melgan_csmsc_ckpt_0.1.1/default.yaml
\
--voc_ckpt
=
style_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1500000.pdz
\
--voc_stat
=
style_melgan_csmsc_ckpt_0.1.1/feats_stats.npy
\
--test_metadata
=
dump/test/norm/metadata.jsonl
\
--output_dir
=
${
train_output_path
}
/test
\
--phones_dict
=
dump/phone_id_map.txt
\
--tones_dict
=
dump/tone_id_map.txt
fi
# hifigan
if
[
${
stage
}
-le
3
]
&&
[
${
stop_stage
}
-ge
3
]
;
then
echo
"in hifigan syn"
FLAGS_allocator_strategy
=
naive_best_fit
\
FLAGS_fraction_of_gpu_memory_to_use
=
0.01
\
python3
${
BIN_DIR
}
/../synthesize.py
\
--am
=
speedyspeech_csmsc
\
--am_config
=
${
config_path
}
\
--am_ckpt
=
${
train_output_path
}
/checkpoints/
${
ckpt_name
}
\
--am_stat
=
dump/train/speech_stats.npy
\
--voc
=
hifigan_csmsc
\
--voc_config
=
hifigan_csmsc_ckpt_0.1.1/default.yaml
\
--voc_ckpt
=
hifigan_csmsc_ckpt_0.1.1/snapshot_iter_2500000.pdz
\
--voc_stat
=
hifigan_csmsc_ckpt_0.1.1/feats_stats.npy
\
--test_metadata
=
dump/test/norm/metadata.jsonl
\
--output_dir
=
${
train_output_path
}
/test
\
--phones_dict
=
dump/phone_id_map.txt
\
--tones_dict
=
dump/tone_id_map.txt
fi
FLAGS_allocator_strategy
=
naive_best_fit
\
FLAGS_fraction_of_gpu_memory_to_use
=
0.01
\
python3
${
BIN_DIR
}
/../synthesize.py
\
--am
=
speedyspeech_csmsc
\
--am_config
=
${
config_path
}
\
--am_ckpt
=
${
train_output_path
}
/checkpoints/
${
ckpt_name
}
\
--am_stat
=
dump/train/feats_stats.npy
\
--voc
=
pwgan_csmsc
\
--voc_config
=
pwg_baker_ckpt_0.4/pwg_default.yaml
\
--voc_ckpt
=
pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz
\
--voc_stat
=
pwg_baker_ckpt_0.4/pwg_stats.npy
\
--test_metadata
=
dump/test/norm/metadata.jsonl
\
--output_dir
=
${
train_output_path
}
/test
\
--phones_dict
=
dump/phone_id_map.txt
\
--tones_dict
=
dump/tone_id_map.txt
\ No newline at end of file
# wavernn
if
[
${
stage
}
-le
4
]
&&
[
${
stop_stage
}
-ge
4
]
;
then
echo
"in wavernn syn"
FLAGS_allocator_strategy
=
naive_best_fit
\
FLAGS_fraction_of_gpu_memory_to_use
=
0.01
\
python3
${
BIN_DIR
}
/../synthesize.py
\
--am
=
speedyspeech_csmsc
\
--am_config
=
${
config_path
}
\
--am_ckpt
=
${
train_output_path
}
/checkpoints/
${
ckpt_name
}
\
--am_stat
=
dump/train/speech_stats.npy
\
--voc
=
wavernn_csmsc
\
--voc_config
=
wavernn_csmsc_ckpt_0.2.0/default.yaml
\
--voc_ckpt
=
wavernn_csmsc_ckpt_0.2.0/snapshot_iter_400000.pdz
\
--voc_stat
=
wavernn_csmsc_ckpt_0.2.0/feats_stats.npy
\
--test_metadata
=
dump/test/norm/metadata.jsonl
\
--output_dir
=
${
train_output_path
}
/test
\
--tones_dict
=
dump/tone_id_map.txt
\
--phones_dict
=
dump/phone_id_map.txt
fi
examples/csmsc/tts2/local/synthesize_e2e.sh
浏览文件 @
175c39b4
...
...
@@ -7,6 +7,7 @@ ckpt_name=$3
stage
=
0
stop_stage
=
0
# pwgan
if
[
${
stage
}
-le
0
]
&&
[
${
stop_stage
}
-ge
0
]
;
then
FLAGS_allocator_strategy
=
naive_best_fit
\
FLAGS_fraction_of_gpu_memory_to_use
=
0.01
\
...
...
@@ -22,9 +23,9 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
--lang
=
zh
\
--text
=
${
BIN_DIR
}
/../sentences.txt
\
--output_dir
=
${
train_output_path
}
/test_e2e
\
--inference_dir
=
${
train_output_path
}
/inference
\
--phones_dict
=
dump/phone_id_map.txt
\
--tones_dict
=
dump/tone_id_map.txt
--tones_dict
=
dump/tone_id_map.txt
\
--inference_dir
=
${
train_output_path
}
/inference
fi
# for more GAN Vocoders
...
...
@@ -44,9 +45,9 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
--lang
=
zh
\
--text
=
${
BIN_DIR
}
/../sentences.txt
\
--output_dir
=
${
train_output_path
}
/test_e2e
\
--inference_dir
=
${
train_output_path
}
/inference
\
--phones_dict
=
dump/phone_id_map.txt
\
--tones_dict
=
dump/tone_id_map.txt
--tones_dict
=
dump/tone_id_map.txt
\
--inference_dir
=
${
train_output_path
}
/inference
fi
# the pretrained models haven't release now
...
...
@@ -88,12 +89,11 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
--lang
=
zh
\
--text
=
${
BIN_DIR
}
/../sentences.txt
\
--output_dir
=
${
train_output_path
}
/test_e2e
\
--inference_dir
=
${
train_output_path
}
/inference
\
--phones_dict
=
dump/phone_id_map.txt
\
--tones_dict
=
dump/tone_id_map.txt
--tones_dict
=
dump/tone_id_map.txt
\
--inference_dir
=
${
train_output_path
}
/inference
fi
# wavernn
if
[
${
stage
}
-le
4
]
&&
[
${
stop_stage
}
-ge
4
]
;
then
echo
"in wavernn syn_e2e"
...
...
examples/csmsc/tts3/local/synthesize.sh
浏览文件 @
175c39b4
...
...
@@ -3,18 +3,98 @@
config_path
=
$1
train_output_path
=
$2
ckpt_name
=
$3
stage
=
0
stop_stage
=
0
FLAGS_allocator_strategy
=
naive_best_fit
\
FLAGS_fraction_of_gpu_memory_to_use
=
0.01
\
python3
${
BIN_DIR
}
/../synthesize.py
\
--am
=
fastspeech2_csmsc
\
--am_config
=
${
config_path
}
\
--am_ckpt
=
${
train_output_path
}
/checkpoints/
${
ckpt_name
}
\
--am_stat
=
dump/train/speech_stats.npy
\
--voc
=
pwgan_csmsc
\
--voc_config
=
pwg_baker_ckpt_0.4/pwg_default.yaml
\
--voc_ckpt
=
pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz
\
--voc_stat
=
pwg_baker_ckpt_0.4/pwg_stats.npy
\
--test_metadata
=
dump/test/norm/metadata.jsonl
\
--output_dir
=
${
train_output_path
}
/test
\
--phones_dict
=
dump/phone_id_map.txt
# pwgan
if
[
${
stage
}
-le
0
]
&&
[
${
stop_stage
}
-ge
0
]
;
then
FLAGS_allocator_strategy
=
naive_best_fit
\
FLAGS_fraction_of_gpu_memory_to_use
=
0.01
\
python3
${
BIN_DIR
}
/../synthesize.py
\
--am
=
fastspeech2_csmsc
\
--am_config
=
${
config_path
}
\
--am_ckpt
=
${
train_output_path
}
/checkpoints/
${
ckpt_name
}
\
--am_stat
=
dump/train/speech_stats.npy
\
--voc
=
pwgan_csmsc
\
--voc_config
=
pwg_baker_ckpt_0.4/pwg_default.yaml
\
--voc_ckpt
=
pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz
\
--voc_stat
=
pwg_baker_ckpt_0.4/pwg_stats.npy
\
--test_metadata
=
dump/test/norm/metadata.jsonl
\
--output_dir
=
${
train_output_path
}
/test
\
--phones_dict
=
dump/phone_id_map.txt
fi
# for more GAN Vocoders
# multi band melgan
if
[
${
stage
}
-le
1
]
&&
[
${
stop_stage
}
-ge
1
]
;
then
FLAGS_allocator_strategy
=
naive_best_fit
\
FLAGS_fraction_of_gpu_memory_to_use
=
0.01
\
python3
${
BIN_DIR
}
/../synthesize.py
\
--am
=
fastspeech2_csmsc
\
--am_config
=
${
config_path
}
\
--am_ckpt
=
${
train_output_path
}
/checkpoints/
${
ckpt_name
}
\
--am_stat
=
dump/train/speech_stats.npy
\
--voc
=
mb_melgan_csmsc
\
--voc_config
=
mb_melgan_csmsc_ckpt_0.1.1/default.yaml
\
--voc_ckpt
=
mb_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1000000.pdz
\
--voc_stat
=
mb_melgan_csmsc_ckpt_0.1.1/feats_stats.npy
\
--test_metadata
=
dump/test/norm/metadata.jsonl
\
--output_dir
=
${
train_output_path
}
/test
\
--phones_dict
=
dump/phone_id_map.txt
fi
# style melgan
if
[
${
stage
}
-le
2
]
&&
[
${
stop_stage
}
-ge
2
]
;
then
FLAGS_allocator_strategy
=
naive_best_fit
\
FLAGS_fraction_of_gpu_memory_to_use
=
0.01
\
python3
${
BIN_DIR
}
/../synthesize.py
\
--am
=
fastspeech2_csmsc
\
--am_config
=
${
config_path
}
\
--am_ckpt
=
${
train_output_path
}
/checkpoints/
${
ckpt_name
}
\
--am_stat
=
dump/train/speech_stats.npy
\
--voc
=
style_melgan_csmsc
\
--voc_config
=
style_melgan_csmsc_ckpt_0.1.1/default.yaml
\
--voc_ckpt
=
style_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1500000.pdz
\
--voc_stat
=
style_melgan_csmsc_ckpt_0.1.1/feats_stats.npy
\
--test_metadata
=
dump/test/norm/metadata.jsonl
\
--output_dir
=
${
train_output_path
}
/test
\
--phones_dict
=
dump/phone_id_map.txt
fi
# hifigan
if
[
${
stage
}
-le
3
]
&&
[
${
stop_stage
}
-ge
3
]
;
then
echo
"in hifigan syn"
FLAGS_allocator_strategy
=
naive_best_fit
\
FLAGS_fraction_of_gpu_memory_to_use
=
0.01
\
python3
${
BIN_DIR
}
/../synthesize.py
\
--am
=
fastspeech2_csmsc
\
--am_config
=
${
config_path
}
\
--am_ckpt
=
${
train_output_path
}
/checkpoints/
${
ckpt_name
}
\
--am_stat
=
dump/train/speech_stats.npy
\
--voc
=
hifigan_csmsc
\
--voc_config
=
hifigan_csmsc_ckpt_0.1.1/default.yaml
\
--voc_ckpt
=
hifigan_csmsc_ckpt_0.1.1/snapshot_iter_2500000.pdz
\
--voc_stat
=
hifigan_csmsc_ckpt_0.1.1/feats_stats.npy
\
--test_metadata
=
dump/test/norm/metadata.jsonl
\
--output_dir
=
${
train_output_path
}
/test
\
--phones_dict
=
dump/phone_id_map.txt
fi
# wavernn
if
[
${
stage
}
-le
4
]
&&
[
${
stop_stage
}
-ge
4
]
;
then
echo
"in wavernn syn"
FLAGS_allocator_strategy
=
naive_best_fit
\
FLAGS_fraction_of_gpu_memory_to_use
=
0.01
\
python3
${
BIN_DIR
}
/../synthesize.py
\
--am
=
fastspeech2_csmsc
\
--am_config
=
${
config_path
}
\
--am_ckpt
=
${
train_output_path
}
/checkpoints/
${
ckpt_name
}
\
--am_stat
=
dump/train/speech_stats.npy
\
--voc
=
wavernn_csmsc
\
--voc_config
=
wavernn_csmsc_ckpt_0.2.0/default.yaml
\
--voc_ckpt
=
wavernn_csmsc_ckpt_0.2.0/snapshot_iter_400000.pdz
\
--voc_stat
=
wavernn_csmsc_ckpt_0.2.0/feats_stats.npy
\
--test_metadata
=
dump/test/norm/metadata.jsonl
\
--output_dir
=
${
train_output_path
}
/test
\
--phones_dict
=
dump/phone_id_map.txt
fi
examples/csmsc/tts3/local/synthesize_e2e.sh
浏览文件 @
175c39b4
...
...
@@ -7,6 +7,7 @@ ckpt_name=$3
stage
=
0
stop_stage
=
0
# pwgan
if
[
${
stage
}
-le
0
]
&&
[
${
stop_stage
}
-ge
0
]
;
then
FLAGS_allocator_strategy
=
naive_best_fit
\
FLAGS_fraction_of_gpu_memory_to_use
=
0.01
\
...
...
@@ -22,8 +23,8 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
--lang
=
zh
\
--text
=
${
BIN_DIR
}
/../sentences.txt
\
--output_dir
=
${
train_output_path
}
/test_e2e
\
--
inference_dir
=
${
train_output_path
}
/inference
\
--
phones_dict
=
dump/phone_id_map.txt
--
phones_dict
=
dump/phone_id_map.txt
\
--
inference_dir
=
${
train_output_path
}
/inference
fi
# for more GAN Vocoders
...
...
@@ -43,8 +44,8 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
--lang
=
zh
\
--text
=
${
BIN_DIR
}
/../sentences.txt
\
--output_dir
=
${
train_output_path
}
/test_e2e
\
--
inference_dir
=
${
train_output_path
}
/inference
\
--
phones_dict
=
dump/phone_id_map.txt
--
phones_dict
=
dump/phone_id_map.txt
\
--
inference_dir
=
${
train_output_path
}
/inference
fi
# the pretrained models haven't release now
...
...
@@ -86,8 +87,8 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
--lang
=
zh
\
--text
=
${
BIN_DIR
}
/../sentences.txt
\
--output_dir
=
${
train_output_path
}
/test_e2e
\
--
inference_dir
=
${
train_output_path
}
/inference
\
--
phones_dict
=
dump/phone_id_map.txt
--
phones_dict
=
dump/phone_id_map.txt
\
--
inference_dir
=
${
train_output_path
}
/inference
fi
...
...
paddlespeech/t2s/exps/synthesize.py
浏览文件 @
175c39b4
...
...
@@ -20,6 +20,7 @@ import numpy as np
import
paddle
import
soundfile
as
sf
import
yaml
from
timer
import
timer
from
yacs.config
import
CfgNode
from
paddlespeech.s2t.utils.dynamic_import
import
dynamic_import
...
...
@@ -50,6 +51,18 @@ model_alias = {
"paddlespeech.t2s.models.melgan:MelGANGenerator"
,
"mb_melgan_inference"
:
"paddlespeech.t2s.models.melgan:MelGANInference"
,
"style_melgan"
:
"paddlespeech.t2s.models.melgan:StyleMelGANGenerator"
,
"style_melgan_inference"
:
"paddlespeech.t2s.models.melgan:StyleMelGANInference"
,
"hifigan"
:
"paddlespeech.t2s.models.hifigan:HiFiGANGenerator"
,
"hifigan_inference"
:
"paddlespeech.t2s.models.hifigan:HiFiGANInference"
,
"wavernn"
:
"paddlespeech.t2s.models.wavernn:WaveRNN"
,
"wavernn_inference"
:
"paddlespeech.t2s.models.wavernn:WaveRNNInference"
,
}
...
...
@@ -146,10 +159,15 @@ def evaluate(args):
voc_name
=
args
.
voc
[:
args
.
voc
.
rindex
(
'_'
)]
voc_class
=
dynamic_import
(
voc_name
,
model_alias
)
voc_inference_class
=
dynamic_import
(
voc_name
+
'_inference'
,
model_alias
)
voc
=
voc_class
(
**
voc_config
[
"generator_params"
])
voc
.
set_state_dict
(
paddle
.
load
(
args
.
voc_ckpt
)[
"generator_params"
])
voc
.
remove_weight_norm
()
voc
.
eval
()
if
voc_name
!=
'wavernn'
:
voc
=
voc_class
(
**
voc_config
[
"generator_params"
])
voc
.
set_state_dict
(
paddle
.
load
(
args
.
voc_ckpt
)[
"generator_params"
])
voc
.
remove_weight_norm
()
voc
.
eval
()
else
:
voc
=
voc_class
(
**
voc_config
[
"model"
])
voc
.
set_state_dict
(
paddle
.
load
(
args
.
voc_ckpt
)[
"main_params"
])
voc
.
eval
()
voc_mu
,
voc_std
=
np
.
load
(
args
.
voc_stat
)
voc_mu
=
paddle
.
to_tensor
(
voc_mu
)
voc_std
=
paddle
.
to_tensor
(
voc_std
)
...
...
@@ -162,38 +180,51 @@ def evaluate(args):
output_dir
=
Path
(
args
.
output_dir
)
output_dir
.
mkdir
(
parents
=
True
,
exist_ok
=
True
)
N
=
0
T
=
0
for
datum
in
test_dataset
:
utt_id
=
datum
[
"utt_id"
]
with
paddle
.
no_grad
():
# acoustic model
if
am_name
==
'fastspeech2'
:
phone_ids
=
paddle
.
to_tensor
(
datum
[
"text"
])
spk_emb
=
None
spk_id
=
None
# multi speaker
if
args
.
voice_cloning
and
"spk_emb"
in
datum
:
spk_emb
=
paddle
.
to_tensor
(
np
.
load
(
datum
[
"spk_emb"
]))
elif
"spk_id"
in
datum
:
spk_id
=
paddle
.
to_tensor
(
datum
[
"spk_id"
])
mel
=
am_inference
(
phone_ids
,
spk_id
=
spk_id
,
spk_emb
=
spk_emb
)
elif
am_name
==
'speedyspeech'
:
phone_ids
=
paddle
.
to_tensor
(
datum
[
"phones"
])
tone_ids
=
paddle
.
to_tensor
(
datum
[
"tones"
])
mel
=
am_inference
(
phone_ids
,
tone_ids
)
elif
am_name
==
'tacotron2'
:
phone_ids
=
paddle
.
to_tensor
(
datum
[
"text"
])
spk_emb
=
None
# multi speaker
if
args
.
voice_cloning
and
"spk_emb"
in
datum
:
spk_emb
=
paddle
.
to_tensor
(
np
.
load
(
datum
[
"spk_emb"
]))
mel
=
am_inference
(
phone_ids
,
spk_emb
=
spk_emb
)
with
timer
()
as
t
:
with
paddle
.
no_grad
():
# acoustic model
if
am_name
==
'fastspeech2'
:
phone_ids
=
paddle
.
to_tensor
(
datum
[
"text"
])
spk_emb
=
None
spk_id
=
None
# multi speaker
if
args
.
voice_cloning
and
"spk_emb"
in
datum
:
spk_emb
=
paddle
.
to_tensor
(
np
.
load
(
datum
[
"spk_emb"
]))
elif
"spk_id"
in
datum
:
spk_id
=
paddle
.
to_tensor
(
datum
[
"spk_id"
])
mel
=
am_inference
(
phone_ids
,
spk_id
=
spk_id
,
spk_emb
=
spk_emb
)
elif
am_name
==
'speedyspeech'
:
phone_ids
=
paddle
.
to_tensor
(
datum
[
"phones"
])
tone_ids
=
paddle
.
to_tensor
(
datum
[
"tones"
])
mel
=
am_inference
(
phone_ids
,
tone_ids
)
elif
am_name
==
'tacotron2'
:
phone_ids
=
paddle
.
to_tensor
(
datum
[
"text"
])
spk_emb
=
None
# multi speaker
if
args
.
voice_cloning
and
"spk_emb"
in
datum
:
spk_emb
=
paddle
.
to_tensor
(
np
.
load
(
datum
[
"spk_emb"
]))
mel
=
am_inference
(
phone_ids
,
spk_emb
=
spk_emb
)
# vocoder
wav
=
voc_inference
(
mel
)
wav
=
wav
.
numpy
()
N
+=
wav
.
size
T
+=
t
.
elapse
speed
=
wav
.
size
/
t
.
elapse
rtf
=
am_config
.
fs
/
speed
print
(
f
"
{
utt_id
}
, mel:
{
mel
.
shape
}
, wave:
{
wav
.
size
}
, time:
{
t
.
elapse
}
s, Hz:
{
speed
}
, RTF:
{
rtf
}
."
)
sf
.
write
(
str
(
output_dir
/
(
utt_id
+
".wav"
)),
wav
.
numpy
(),
samplerate
=
am_config
.
fs
)
str
(
output_dir
/
(
utt_id
+
".wav"
)),
wav
,
samplerate
=
am_config
.
fs
)
print
(
f
"
{
utt_id
}
done!"
)
print
(
f
"generation speed:
{
N
/
T
}
Hz, RTF:
{
am_config
.
fs
/
(
N
/
T
)
}
"
)
def
main
():
...
...
@@ -246,7 +277,8 @@ def main():
default
=
'pwgan_csmsc'
,
choices
=
[
'pwgan_csmsc'
,
'pwgan_ljspeech'
,
'pwgan_aishell3'
,
'pwgan_vctk'
,
'mb_melgan_csmsc'
'mb_melgan_csmsc'
,
'wavernn_csmsc'
,
'hifigan_csmsc'
,
'style_melgan_csmsc'
],
help
=
'Choose vocoder type of tts task.'
)
...
...
paddlespeech/t2s/exps/synthesize_e2e.py
浏览文件 @
175c39b4
...
...
@@ -21,6 +21,7 @@ import soundfile as sf
import
yaml
from
paddle
import
jit
from
paddle.static
import
InputSpec
from
timer
import
timer
from
yacs.config
import
CfgNode
from
paddlespeech.s2t.utils.dynamic_import
import
dynamic_import
...
...
@@ -233,59 +234,68 @@ def evaluate(args):
# but still not stopping in the end (NOTE by yuantian01 Feb 9 2022)
if
am_name
==
'tacotron2'
:
merge_sentences
=
True
N
=
0
T
=
0
for
utt_id
,
sentence
in
sentences
:
get_tone_ids
=
False
if
am_name
==
'speedyspeech'
:
get_tone_ids
=
True
if
args
.
lang
==
'zh'
:
input_ids
=
frontend
.
get_input_ids
(
sentence
,
merge_sentences
=
merge_sentences
,
get_tone_ids
=
get_tone_ids
)
phone_ids
=
input_ids
[
"phone_ids"
]
if
get_tone_ids
:
tone_ids
=
input_ids
[
"tone_ids"
]
elif
args
.
lang
==
'en'
:
input_ids
=
frontend
.
get_input_ids
(
sentence
,
merge_sentences
=
merge_sentences
)
phone_ids
=
input_ids
[
"phone_ids"
]
else
:
print
(
"lang should in {'zh', 'en'}!"
)
with
paddle
.
no_grad
():
flags
=
0
for
i
in
range
(
len
(
phone_ids
)):
part_phone_ids
=
phone_ids
[
i
]
# acoustic model
if
am_name
==
'fastspeech2'
:
# multi speaker
if
am_dataset
in
{
"aishell3"
,
"vctk"
}:
spk_id
=
paddle
.
to_tensor
(
args
.
spk_id
)
mel
=
am_inference
(
part_phone_ids
,
spk_id
)
else
:
with
timer
()
as
t
:
get_tone_ids
=
False
if
am_name
==
'speedyspeech'
:
get_tone_ids
=
True
if
args
.
lang
==
'zh'
:
input_ids
=
frontend
.
get_input_ids
(
sentence
,
merge_sentences
=
merge_sentences
,
get_tone_ids
=
get_tone_ids
)
phone_ids
=
input_ids
[
"phone_ids"
]
if
get_tone_ids
:
tone_ids
=
input_ids
[
"tone_ids"
]
elif
args
.
lang
==
'en'
:
input_ids
=
frontend
.
get_input_ids
(
sentence
,
merge_sentences
=
merge_sentences
)
phone_ids
=
input_ids
[
"phone_ids"
]
else
:
print
(
"lang should in {'zh', 'en'}!"
)
with
paddle
.
no_grad
():
flags
=
0
for
i
in
range
(
len
(
phone_ids
)):
part_phone_ids
=
phone_ids
[
i
]
# acoustic model
if
am_name
==
'fastspeech2'
:
# multi speaker
if
am_dataset
in
{
"aishell3"
,
"vctk"
}:
spk_id
=
paddle
.
to_tensor
(
args
.
spk_id
)
mel
=
am_inference
(
part_phone_ids
,
spk_id
)
else
:
mel
=
am_inference
(
part_phone_ids
)
elif
am_name
==
'speedyspeech'
:
part_tone_ids
=
tone_ids
[
i
]
if
am_dataset
in
{
"aishell3"
,
"vctk"
}:
spk_id
=
paddle
.
to_tensor
(
args
.
spk_id
)
mel
=
am_inference
(
part_phone_ids
,
part_tone_ids
,
spk_id
)
else
:
mel
=
am_inference
(
part_phone_ids
,
part_tone_ids
)
elif
am_name
==
'tacotron2'
:
mel
=
am_inference
(
part_phone_ids
)
elif
am_name
==
'speedyspeech'
:
part_tone_ids
=
tone_ids
[
i
]
if
am_dataset
in
{
"aishell3"
,
"vctk"
}:
spk_id
=
paddle
.
to_tensor
(
args
.
spk_id
)
mel
=
am_inference
(
part_phone_ids
,
part_tone_ids
,
spk_id
)
# vocoder
wav
=
voc_inference
(
mel
)
if
flags
==
0
:
wav_all
=
wav
flags
=
1
else
:
mel
=
am_inference
(
part_phone_ids
,
part_tone_ids
)
elif
am_name
==
'tacotron2'
:
mel
=
am_inference
(
part_phone_ids
)
# vocoder
wav
=
voc_inference
(
mel
)
if
flags
==
0
:
wav_all
=
wav
flags
=
1
else
:
wav_all
=
paddle
.
concat
([
wav_all
,
wav
])
wav_all
=
paddle
.
concat
([
wav_all
,
wav
])
wav
=
wav_all
.
numpy
()
N
+=
wav
.
size
T
+=
t
.
elapse
speed
=
wav
.
size
/
t
.
elapse
rtf
=
am_config
.
fs
/
speed
print
(
f
"
{
utt_id
}
, mel:
{
mel
.
shape
}
, wave:
{
wav
.
shape
}
, time:
{
t
.
elapse
}
s, Hz:
{
speed
}
, RTF:
{
rtf
}
."
)
sf
.
write
(
str
(
output_dir
/
(
utt_id
+
".wav"
)),
wav_all
.
numpy
(),
samplerate
=
am_config
.
fs
)
str
(
output_dir
/
(
utt_id
+
".wav"
)),
wav
,
samplerate
=
am_config
.
fs
)
print
(
f
"
{
utt_id
}
done!"
)
print
(
f
"generation speed:
{
N
/
T
}
Hz, RTF:
{
am_config
.
fs
/
(
N
/
T
)
}
"
)
def
main
():
...
...
paddlespeech/t2s/exps/wavernn/synthesize.py
浏览文件 @
175c39b4
...
...
@@ -91,7 +91,7 @@ def main():
target
=
config
.
inference
.
target
,
overlap
=
config
.
inference
.
overlap
,
mu_law
=
config
.
mu_law
,
gen_display
=
Tru
e
)
gen_display
=
Fals
e
)
wav
=
wav
.
numpy
()
N
+=
wav
.
size
T
+=
t
.
elapse
...
...
paddlespeech/t2s/models/melgan/melgan.py
浏览文件 @
175c39b4
...
...
@@ -66,7 +66,7 @@ class MelGANGenerator(nn.Layer):
nonlinear_activation_params (Dict[str, Any], optional): Parameters passed to the linear activation in the upsample network,
by default {}
pad (str): Padding function module name before dilated convolution layer.
pad_params
(
dict): Hyperparameters for padding function.
pad_params
(
dict): Hyperparameters for padding function.
use_final_nonlinear_activation (nn.Layer): Activation function for the final layer.
use_weight_norm (bool): Whether to use weight norm.
If set to true, it will be applied to all of the conv layers.
...
...
paddlespeech/t2s/models/wavernn/wavernn.py
浏览文件 @
175c39b4
...
...
@@ -509,16 +509,20 @@ class WaveRNN(nn.Layer):
total_len
=
num_folds
*
(
target
+
overlap
)
+
overlap
# Need some silence for the run warmup
slience_len
=
overlap
//
2
slience_len
=
0
linear_len
=
slience_len
fade_len
=
overlap
-
slience_len
slience
=
paddle
.
zeros
([
slience_len
],
dtype
=
paddle
.
float32
)
linear
=
paddle
.
ones
([
fade
_len
],
dtype
=
paddle
.
float32
)
linear
=
paddle
.
ones
([
linear
_len
],
dtype
=
paddle
.
float32
)
# Equal power crossfade
# fade_in increase from 0 to 1, fade_out reduces from 1 to 0
t
=
paddle
.
linspace
(
-
1
,
1
,
fade_len
,
dtype
=
paddle
.
float32
)
fade_in
=
paddle
.
sqrt
(
0.5
*
(
1
+
t
))
fade_out
=
paddle
.
sqrt
(
0.5
*
(
1
-
t
))
sigmoid_scale
=
2.3
t
=
paddle
.
linspace
(
-
sigmoid_scale
,
sigmoid_scale
,
fade_len
,
dtype
=
paddle
.
float32
)
# sigmoid 曲线应该更好
fade_in
=
paddle
.
nn
.
functional
.
sigmoid
(
t
)
fade_out
=
1
-
paddle
.
nn
.
functional
.
sigmoid
(
t
)
# Concat the silence to the fades
fade_out
=
paddle
.
concat
([
linear
,
fade_out
])
fade_in
=
paddle
.
concat
([
slience
,
fade_in
])
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录