Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
DeepSpeech
提交
40ab05a4
D
DeepSpeech
项目概览
PaddlePaddle
/
DeepSpeech
大约 1 年 前同步成功
通知
207
Star
8425
Fork
1598
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
245
列表
看板
标记
里程碑
合并请求
3
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
D
DeepSpeech
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
245
Issue
245
列表
看板
标记
里程碑
合并请求
3
合并请求
3
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
体验新版 GitCode,发现更多精彩内容 >>
未验证
提交
40ab05a4
编写于
3月 10, 2022
作者:
H
Hui Zhang
提交者:
GitHub
3月 10, 2022
浏览文件
操作
浏览文件
下载
差异文件
Merge pull request #1552 from yt605155624/format_syn
[TTS]format synthesize
上级
c8ecb941
544c372b
变更
7
隐藏空白更改
内联
并排
Showing
7 changed file
with
551 addition
and
468 deletion
+551
-468
paddlespeech/t2s/exps/csmsc_test.txt
paddlespeech/t2s/exps/csmsc_test.txt
+100
-0
paddlespeech/t2s/exps/inference.py
paddlespeech/t2s/exps/inference.py
+155
-92
paddlespeech/t2s/exps/syn_utils.py
paddlespeech/t2s/exps/syn_utils.py
+243
-0
paddlespeech/t2s/exps/synthesize.py
paddlespeech/t2s/exps/synthesize.py
+12
-130
paddlespeech/t2s/exps/synthesize_e2e.py
paddlespeech/t2s/exps/synthesize_e2e.py
+19
-181
paddlespeech/t2s/exps/voice_cloning.py
paddlespeech/t2s/exps/voice_cloning.py
+12
-65
paddlespeech/t2s/modules/predictor/length_regulator.py
paddlespeech/t2s/modules/predictor/length_regulator.py
+10
-0
未找到文件。
paddlespeech/t2s/exps/csmsc_test.txt
0 → 100644
浏览文件 @
40ab05a4
009901 昨日,这名伤者与医生全部被警方依法刑事拘留。
009902 钱伟长想到上海来办学校是经过深思熟虑的。
009903 她见我一进门就骂,吃饭时也骂,骂得我抬不起头。
009904 李述德在离开之前,只说了一句柱驼杀父亲了。
009905 这种车票和保险单捆绑出售属于重复性购买。
009906 戴佩妮的男友西米露接唱情歌,让她非常开心。
009907 观大势,谋大局,出大策始终是该院的办院方针。
009908 他们骑着摩托回家,正好为农忙时的父母帮忙。
009909 但是因为还没到退休年龄,只能掰着指头捱日子。
009910 这几天雨水不断,人们恨不得待在家里不出门。
009911 没想到徐赟,张海翔两人就此玩起了人间蒸发。
009912 藤村此番发言可能是为了凸显野田的领导能力。
009913 程长庚,生在清王朝嘉庆年间,安徽的潜山小县。
009914 南海海域综合补给基地码头项目正在论证中。
009915 也就是说今晚成都市民极有可能再次看到飘雪。
009916 随着天气转热,各地的游泳场所开始人头攒动。
009917 更让徐先生纳闷的是,房客的手机也打不通了。
009918 遇到颠簸时,应听从乘务员的安全指令,回座位坐好。
009919 他在后面呆惯了,怕自己一插身后的人会不满,不敢排进去。
009920 傍晚七个小人回来了,白雪公主说,你们就是我命中的七个小矮人吧。
009921 他本想说,教育局管这个,他们是一路的,这样一管岂不是妓女起嫖客?
009922 一种表示商品所有权的财物证券,也称商品证券,如提货单,交货单。
009923 会有很丰富的东西留下来,说都说不完。
009924 这句话像从天而降,吓得四周一片寂静。
009925 记者所在的是受害人家属所在的右区。
009926 不管哈大爷去哪,它都一步不离地跟着。
009927 大家抬头望去,一只老鼠正趴在吊顶上。
009928 我决定过年就辞职,接手我爸的废品站!
009929 最终,中国男子乒乓球队获得此奖项。
009930 防汛抗旱两手抓,抗旱相对抓的不够。
009931 图们江下游地区开发开放的进展如何?
009932 这要求中国必须有一个坚强的政党领导。
009933 再说,关于利益上的事俺俩都不好开口。
009934 明代瓦剌,鞑靼入侵明境也是通过此地。
009935 咪咪舔着孩子,把它身上的毛舔干净。
009936 是否这次的国标修订被大企业绑架了?
009937 判决后,姚某妻子胡某不服,提起上诉。
009938 由此可以看出邯钢的经济效益来自何处。
009939 琳达说,是瑜伽改变了她和马儿的生活。
009940 楼下的保安告诉记者,这里不租也不卖。
009941 习近平说,中斯两国人民传统友谊深厚。
009942 传闻越来越多,后来连老汉儿自己都怕了。
009943 我怒吼一声冲上去,举起砖头砸了过去。
009944 我现在还不会,这就回去问问发明我的人。
009945 显然,洛阳性奴案不具备上述两个前提。
009946 另外,杰克逊有文唇线,眼线,眉毛的动作。
009947 昨晚,华西都市报记者电话采访了尹琪。
009948 涅拉季科未透露这些航空公司的名称。
009949 从运行轨迹上来说,它也不可能是星星。
009950 目前看,如果继续加息也存在两难问题。
009951 曾宝仪在节目录制现场大爆观众糗事。
009952 但任凭周某怎么叫,男子仍酣睡不醒。
009953 老大爷说,小子,你挡我财路了,知道不?
009954 没料到,闯下大头佛的阿伟还不知悔改。
009955 卡扎菲部落式统治已遭遇部落内讧。
009956 这个孩子的生命一半来源于另一位女士捐赠的冷冻卵子。
009957 出现这种泥鳅内阁的局面既是野田有意为之,也实属无奈。
009958 济青高速济南,华山,章丘,邹平,周村,淄博,临淄站。
009959 赵凌飞的话,反映了沈阳赛区所有奥运志愿者的共同心声。
009960 因为,我们所发出的力量必会因难度加大而减弱。
009961 发生事故的楼梯拐角处仍可看到血迹。
009962 想过进公安,可能身高不够,老汉儿也不让我进去。
009963 路上关卡很多,为了方便撤离,只好轻装前进。
009964 原来比尔盖茨就是美国微软公司联合创始人呀。
009965 之后他们一家三口将与双方父母往峇里岛旅游。
009966 谢谢总理,也感谢广大网友的参与,我们明年再见。
009967 事实上是,从来没有一个欺善怕恶的人能作出过稍大一点的成就。
009968 我会打开邮件,你可以从那里继续。
009969 美方对近期东海局势表示关切。
009970 据悉,奥巴马一家人对这座冬季白宫极为满意。
009971 打扫完你会很有成就感的,试一试,你就信了。
009972 诺曼站在滑板车上,各就各位,准备出发啦!
009973 塔河的寒夜,气温降到了零下三十多摄氏度。
009974 其间,连破六点六,六点五,六点四,六点三五等多个重要关口。
009975 算命其实只是人们的一种自我安慰和自我暗示而已,我们还是要相信科学才好。
009976 这一切都令人欢欣鼓舞,阿讷西没理由不坚持到最后。
009977 直至公元前一万一千年,它又再次出现。
009978 尽量少玩电脑,少看电视,少打游戏。
009979 从五到七,前后也就是六个月的时间。
009980 一进咖啡店,他就遇见一张熟悉的脸。
009981 好在众弟兄看到了把她追了回来。
009982 有一个人说,哥们儿我们跑过它才能活。
009983 捅了她以后,模糊记得她没咋动了。
009984 从小到大,葛启义没有收到过压岁钱。
009985 舞台下的你会对舞台上的你说什么?
009986 但考生普遍认为,试题的怪多过难。
009987 我希望每个人都能够尊重我们的隐私。
009988 漫天的红霞使劲给两人增添气氛。
009989 晚上加完班开车回家,太累了,迷迷糊糊开着车,走一半的时候,铛一声!
009990 该车将三人撞倒后,在大雾中逃窜。
009991 这人一哆嗦,方向盘也把不稳了,差点撞上了高速边道护栏。
009992 那女孩儿委屈的说,我一回头见你已经进去了我不敢进去啊!
009993 小明摇摇头说,不是,我只是美女看多了,想换个口味而已。
009994 接下来,红娘要求记者交费,记者表示不知表姐身份证号码。
009995 李东蓊表示,自己当时在法庭上发表了一次独特的公诉意见。
009996 另一男子扑了上来,手里拿着明晃晃的长刀,向他胸口直刺。
009997 今天,快递员拿着一个快递在办公室喊,秦王是哪个,有他快递?
009998 这场抗议活动究竟是如何发展演变的,又究竟是谁伤害了谁?
009999 因华国锋肖鸡,墓地设计根据其属相设计。
010000 在狱中,张明宝悔恨交加,写了一份忏悔书。
paddlespeech/t2s/exps/inference.py
浏览文件 @
40ab05a4
...
...
@@ -17,13 +17,92 @@ from pathlib import Path
import
numpy
import
soundfile
as
sf
from
paddle
import
inference
from
paddlespeech.t2s.frontend
import
English
from
paddlespeech.t2s.frontend.zh_frontend
import
Frontend
from
timer
import
timer
from
paddlespeech.t2s.exps.syn_utils
import
get_frontend
from
paddlespeech.t2s.exps.syn_utils
import
get_sentences
from
paddlespeech.t2s.utils
import
str2bool
def
get_predictor
(
args
,
filed
=
'am'
):
full_name
=
''
if
filed
==
'am'
:
full_name
=
args
.
am
elif
filed
==
'voc'
:
full_name
=
args
.
voc
model_name
=
full_name
[:
full_name
.
rindex
(
'_'
)]
config
=
inference
.
Config
(
str
(
Path
(
args
.
inference_dir
)
/
(
full_name
+
".pdmodel"
)),
str
(
Path
(
args
.
inference_dir
)
/
(
full_name
+
".pdiparams"
)))
if
args
.
device
==
"gpu"
:
config
.
enable_use_gpu
(
100
,
0
)
elif
args
.
device
==
"cpu"
:
config
.
disable_gpu
()
# This line must be commented for fastspeech2, if not, it will OOM
if
model_name
!=
'fastspeech2'
:
config
.
enable_memory_optim
()
predictor
=
inference
.
create_predictor
(
config
)
return
predictor
# only inference for models trained with csmsc now
def
main
():
def
get_am_output
(
args
,
am_predictor
,
frontend
,
merge_sentences
,
input
):
am_name
=
args
.
am
[:
args
.
am
.
rindex
(
'_'
)]
am_dataset
=
args
.
am
[
args
.
am
.
rindex
(
'_'
)
+
1
:]
am_input_names
=
am_predictor
.
get_input_names
()
get_tone_ids
=
False
get_spk_id
=
False
if
am_name
==
'speedyspeech'
:
get_tone_ids
=
True
if
am_dataset
in
{
"aishell3"
,
"vctk"
}
and
args
.
speaker_dict
:
get_spk_id
=
True
spk_id
=
numpy
.
array
([
args
.
spk_id
])
if
args
.
lang
==
'zh'
:
input_ids
=
frontend
.
get_input_ids
(
input
,
merge_sentences
=
merge_sentences
,
get_tone_ids
=
get_tone_ids
)
phone_ids
=
input_ids
[
"phone_ids"
]
elif
args
.
lang
==
'en'
:
input_ids
=
frontend
.
get_input_ids
(
input
,
merge_sentences
=
merge_sentences
)
phone_ids
=
input_ids
[
"phone_ids"
]
else
:
print
(
"lang should in {'zh', 'en'}!"
)
if
get_tone_ids
:
tone_ids
=
input_ids
[
"tone_ids"
]
tones
=
tone_ids
[
0
].
numpy
()
tones_handle
=
am_predictor
.
get_input_handle
(
am_input_names
[
1
])
tones_handle
.
reshape
(
tones
.
shape
)
tones_handle
.
copy_from_cpu
(
tones
)
if
get_spk_id
:
spk_id_handle
=
am_predictor
.
get_input_handle
(
am_input_names
[
1
])
spk_id_handle
.
reshape
(
spk_id
.
shape
)
spk_id_handle
.
copy_from_cpu
(
spk_id
)
phones
=
phone_ids
[
0
].
numpy
()
phones_handle
=
am_predictor
.
get_input_handle
(
am_input_names
[
0
])
phones_handle
.
reshape
(
phones
.
shape
)
phones_handle
.
copy_from_cpu
(
phones
)
am_predictor
.
run
()
am_output_names
=
am_predictor
.
get_output_names
()
am_output_handle
=
am_predictor
.
get_output_handle
(
am_output_names
[
0
])
am_output_data
=
am_output_handle
.
copy_to_cpu
()
return
am_output_data
def
get_voc_output
(
args
,
voc_predictor
,
input
):
voc_input_names
=
voc_predictor
.
get_input_names
()
mel_handle
=
voc_predictor
.
get_input_handle
(
voc_input_names
[
0
])
mel_handle
.
reshape
(
input
.
shape
)
mel_handle
.
copy_from_cpu
(
input
)
voc_predictor
.
run
()
voc_output_names
=
voc_predictor
.
get_output_names
()
voc_output_handle
=
voc_predictor
.
get_output_handle
(
voc_output_names
[
0
])
wav
=
voc_output_handle
.
copy_to_cpu
()
return
wav
def
parse_args
():
parser
=
argparse
.
ArgumentParser
(
description
=
"Paddle Infernce with speedyspeech & parallel wavegan."
)
# acoustic model
...
...
@@ -70,113 +149,97 @@ def main():
parser
.
add_argument
(
"--inference_dir"
,
type
=
str
,
help
=
"dir to save inference models"
)
parser
.
add_argument
(
"--output_dir"
,
type
=
str
,
help
=
"output dir"
)
# inference
parser
.
add_argument
(
"--use_trt"
,
type
=
str2bool
,
default
=
False
,
help
=
"Whether to use inference engin TensorRT."
,
)
parser
.
add_argument
(
"--int8"
,
type
=
str2bool
,
default
=
False
,
help
=
"Whether to use int8 inference."
,
)
parser
.
add_argument
(
"--fp16"
,
type
=
str2bool
,
default
=
False
,
help
=
"Whether to use float16 inference."
,
)
parser
.
add_argument
(
"--device"
,
default
=
"gpu"
,
choices
=
[
"gpu"
,
"cpu"
],
help
=
"Device selected for inference."
,
)
args
,
_
=
parser
.
parse_known_args
()
return
args
# only inference for models trained with csmsc now
def
main
():
args
=
parse_args
()
# frontend
if
args
.
lang
==
'zh'
:
frontend
=
Frontend
(
phone_vocab_path
=
args
.
phones_dict
,
tone_vocab_path
=
args
.
tones_dict
)
elif
args
.
lang
==
'en'
:
frontend
=
English
(
phone_vocab_path
=
args
.
phones_dict
)
print
(
"frontend done!"
)
frontend
=
get_frontend
(
args
)
# am_predictor
am_predictor
=
get_predictor
(
args
,
filed
=
'am'
)
# model: {model_name}_{dataset}
am_name
=
args
.
am
[:
args
.
am
.
rindex
(
'_'
)]
am_dataset
=
args
.
am
[
args
.
am
.
rindex
(
'_'
)
+
1
:]
am_config
=
inference
.
Config
(
str
(
Path
(
args
.
inference_dir
)
/
(
args
.
am
+
".pdmodel"
)),
str
(
Path
(
args
.
inference_dir
)
/
(
args
.
am
+
".pdiparams"
)))
am_config
.
enable_use_gpu
(
100
,
0
)
# This line must be commented for fastspeech2, if not, it will OOM
if
am_name
!=
'fastspeech2'
:
am_config
.
enable_memory_optim
()
am_predictor
=
inference
.
create_predictor
(
am_config
)
voc_config
=
inference
.
Config
(
str
(
Path
(
args
.
inference_dir
)
/
(
args
.
voc
+
".pdmodel"
)),
str
(
Path
(
args
.
inference_dir
)
/
(
args
.
voc
+
".pdiparams"
)))
voc_config
.
enable_use_gpu
(
100
,
0
)
voc_config
.
enable_memory_optim
()
voc_predictor
=
inference
.
create_predictor
(
voc_config
)
# voc_predictor
voc_predictor
=
get_predictor
(
args
,
filed
=
'voc'
)
output_dir
=
Path
(
args
.
output_dir
)
output_dir
.
mkdir
(
parents
=
True
,
exist_ok
=
True
)
sentences
=
[]
print
(
"in new inference"
)
# construct dataset for evaluation
sentences
=
[]
with
open
(
args
.
text
,
'rt'
)
as
f
:
for
line
in
f
:
items
=
line
.
strip
().
split
()
utt_id
=
items
[
0
]
if
args
.
lang
==
'zh'
:
sentence
=
""
.
join
(
items
[
1
:])
elif
args
.
lang
==
'en'
:
sentence
=
" "
.
join
(
items
[
1
:])
sentences
.
append
((
utt_id
,
sentence
))
get_tone_ids
=
False
get_spk_id
=
False
if
am_name
==
'speedyspeech'
:
get_tone_ids
=
True
if
am_dataset
in
{
"aishell3"
,
"vctk"
}
and
args
.
speaker_dict
:
get_spk_id
=
True
spk_id
=
numpy
.
array
([
args
.
spk_id
])
sentences
=
get_sentences
(
args
)
am_input_names
=
am_predictor
.
get_input_names
()
print
(
"am_input_names:"
,
am_input_names
)
merge_sentences
=
True
fs
=
24000
if
am_dataset
!=
'ljspeech'
else
22050
# warmup
for
utt_id
,
sentence
in
sentences
[:
3
]:
with
timer
()
as
t
:
am_output_data
=
get_am_output
(
args
,
am_predictor
=
am_predictor
,
frontend
=
frontend
,
merge_sentences
=
merge_sentences
,
input
=
sentence
)
wav
=
get_voc_output
(
args
,
voc_predictor
=
voc_predictor
,
input
=
am_output_data
)
speed
=
wav
.
size
/
t
.
elapse
rtf
=
fs
/
speed
print
(
f
"
{
utt_id
}
, mel:
{
am_output_data
.
shape
}
, wave:
{
wav
.
shape
}
, time:
{
t
.
elapse
}
s, Hz:
{
speed
}
, RTF:
{
rtf
}
."
)
print
(
"warm up done!"
)
N
=
0
T
=
0
for
utt_id
,
sentence
in
sentences
:
if
args
.
lang
==
'zh'
:
input_ids
=
frontend
.
get_input_ids
(
sentence
,
with
timer
()
as
t
:
am_output_data
=
get_am_output
(
args
,
am_predictor
=
am_predictor
,
frontend
=
frontend
,
merge_sentences
=
merge_sentences
,
get_tone_ids
=
get_tone_ids
)
phone_ids
=
input_ids
[
"phone_ids"
]
elif
args
.
lang
==
'en'
:
input_ids
=
frontend
.
get_input_ids
(
sentence
,
merge_sentences
=
merge_sentences
)
phone_ids
=
input_ids
[
"phone_ids"
]
else
:
print
(
"lang should in {'zh', 'en'}!"
)
if
get_tone_ids
:
tone_ids
=
input_ids
[
"tone_ids"
]
tones
=
tone_ids
[
0
].
numpy
()
tones_handle
=
am_predictor
.
get_input_handle
(
am_input_names
[
1
])
tones_handle
.
reshape
(
tones
.
shape
)
tones_handle
.
copy_from_cpu
(
tones
)
if
get_spk_id
:
spk_id_handle
=
am_predictor
.
get_input_handle
(
am_input_names
[
1
])
spk_id_handle
.
reshape
(
spk_id
.
shape
)
spk_id_handle
.
copy_from_cpu
(
spk_id
)
phones
=
phone_ids
[
0
].
numpy
()
phones_handle
=
am_predictor
.
get_input_handle
(
am_input_names
[
0
])
phones_handle
.
reshape
(
phones
.
shape
)
phones_handle
.
copy_from_cpu
(
phones
)
am_predictor
.
run
()
am_output_names
=
am_predictor
.
get_output_names
()
am_output_handle
=
am_predictor
.
get_output_handle
(
am_output_names
[
0
])
am_output_data
=
am_output_handle
.
copy_to_cpu
()
voc_input_names
=
voc_predictor
.
get_input_names
()
mel_handle
=
voc_predictor
.
get_input_handle
(
voc_input_names
[
0
])
mel_handle
.
reshape
(
am_output_data
.
shape
)
mel_handle
.
copy_from_cpu
(
am_output_data
)
voc_predictor
.
run
()
voc_output_names
=
voc_predictor
.
get_output_names
()
voc_output_handle
=
voc_predictor
.
get_output_handle
(
voc_output_names
[
0
])
wav
=
voc_output_handle
.
copy_to_cpu
()
input
=
sentence
)
wav
=
get_voc_output
(
args
,
voc_predictor
=
voc_predictor
,
input
=
am_output_data
)
N
+=
wav
.
size
T
+=
t
.
elapse
speed
=
wav
.
size
/
t
.
elapse
rtf
=
fs
/
speed
sf
.
write
(
output_dir
/
(
utt_id
+
".wav"
),
wav
,
samplerate
=
24000
)
print
(
f
"
{
utt_id
}
, mel:
{
am_output_data
.
shape
}
, wave:
{
wav
.
shape
}
, time:
{
t
.
elapse
}
s, Hz:
{
speed
}
, RTF:
{
rtf
}
."
)
print
(
f
"
{
utt_id
}
done!"
)
print
(
f
"generation speed:
{
N
/
T
}
Hz, RTF:
{
fs
/
(
N
/
T
)
}
"
)
if
__name__
==
"__main__"
:
...
...
paddlespeech/t2s/exps/syn_utils.py
0 → 100644
浏览文件 @
40ab05a4
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import
os
import
numpy
as
np
import
paddle
from
paddle
import
jit
from
paddle.static
import
InputSpec
from
paddlespeech.s2t.utils.dynamic_import
import
dynamic_import
from
paddlespeech.t2s.datasets.data_table
import
DataTable
from
paddlespeech.t2s.frontend
import
English
from
paddlespeech.t2s.frontend.zh_frontend
import
Frontend
from
paddlespeech.t2s.modules.normalizer
import
ZScore
model_alias
=
{
# acoustic model
"speedyspeech"
:
"paddlespeech.t2s.models.speedyspeech:SpeedySpeech"
,
"speedyspeech_inference"
:
"paddlespeech.t2s.models.speedyspeech:SpeedySpeechInference"
,
"fastspeech2"
:
"paddlespeech.t2s.models.fastspeech2:FastSpeech2"
,
"fastspeech2_inference"
:
"paddlespeech.t2s.models.fastspeech2:FastSpeech2Inference"
,
"tacotron2"
:
"paddlespeech.t2s.models.tacotron2:Tacotron2"
,
"tacotron2_inference"
:
"paddlespeech.t2s.models.tacotron2:Tacotron2Inference"
,
# voc
"pwgan"
:
"paddlespeech.t2s.models.parallel_wavegan:PWGGenerator"
,
"pwgan_inference"
:
"paddlespeech.t2s.models.parallel_wavegan:PWGInference"
,
"mb_melgan"
:
"paddlespeech.t2s.models.melgan:MelGANGenerator"
,
"mb_melgan_inference"
:
"paddlespeech.t2s.models.melgan:MelGANInference"
,
"style_melgan"
:
"paddlespeech.t2s.models.melgan:StyleMelGANGenerator"
,
"style_melgan_inference"
:
"paddlespeech.t2s.models.melgan:StyleMelGANInference"
,
"hifigan"
:
"paddlespeech.t2s.models.hifigan:HiFiGANGenerator"
,
"hifigan_inference"
:
"paddlespeech.t2s.models.hifigan:HiFiGANInference"
,
"wavernn"
:
"paddlespeech.t2s.models.wavernn:WaveRNN"
,
"wavernn_inference"
:
"paddlespeech.t2s.models.wavernn:WaveRNNInference"
,
}
# input
def
get_sentences
(
args
):
# construct dataset for evaluation
sentences
=
[]
with
open
(
args
.
text
,
'rt'
)
as
f
:
for
line
in
f
:
items
=
line
.
strip
().
split
()
utt_id
=
items
[
0
]
if
'lang'
in
args
and
args
.
lang
==
'zh'
:
sentence
=
""
.
join
(
items
[
1
:])
elif
'lang'
in
args
and
args
.
lang
==
'en'
:
sentence
=
" "
.
join
(
items
[
1
:])
sentences
.
append
((
utt_id
,
sentence
))
return
sentences
def
get_test_dataset
(
args
,
test_metadata
,
am_name
,
am_dataset
):
if
am_name
==
'fastspeech2'
:
fields
=
[
"utt_id"
,
"text"
]
if
am_dataset
in
{
"aishell3"
,
"vctk"
}
and
args
.
speaker_dict
:
print
(
"multiple speaker fastspeech2!"
)
fields
+=
[
"spk_id"
]
elif
'voice_cloning'
in
args
and
args
.
voice_cloning
:
print
(
"voice cloning!"
)
fields
+=
[
"spk_emb"
]
else
:
print
(
"single speaker fastspeech2!"
)
elif
am_name
==
'speedyspeech'
:
fields
=
[
"utt_id"
,
"phones"
,
"tones"
]
elif
am_name
==
'tacotron2'
:
fields
=
[
"utt_id"
,
"text"
]
if
'voice_cloning'
in
args
and
args
.
voice_cloning
:
print
(
"voice cloning!"
)
fields
+=
[
"spk_emb"
]
test_dataset
=
DataTable
(
data
=
test_metadata
,
fields
=
fields
)
return
test_dataset
# frontend
def
get_frontend
(
args
):
if
'lang'
in
args
and
args
.
lang
==
'zh'
:
frontend
=
Frontend
(
phone_vocab_path
=
args
.
phones_dict
,
tone_vocab_path
=
args
.
tones_dict
)
elif
'lang'
in
args
and
args
.
lang
==
'en'
:
frontend
=
English
(
phone_vocab_path
=
args
.
phones_dict
)
else
:
print
(
"wrong lang!"
)
print
(
"frontend done!"
)
return
frontend
# dygraph
def
get_am_inference
(
args
,
am_config
):
with
open
(
args
.
phones_dict
,
"r"
)
as
f
:
phn_id
=
[
line
.
strip
().
split
()
for
line
in
f
.
readlines
()]
vocab_size
=
len
(
phn_id
)
print
(
"vocab_size:"
,
vocab_size
)
tone_size
=
None
if
'tones_dict'
in
args
and
args
.
tones_dict
:
with
open
(
args
.
tones_dict
,
"r"
)
as
f
:
tone_id
=
[
line
.
strip
().
split
()
for
line
in
f
.
readlines
()]
tone_size
=
len
(
tone_id
)
print
(
"tone_size:"
,
tone_size
)
spk_num
=
None
if
'speaker_dict'
in
args
and
args
.
speaker_dict
:
with
open
(
args
.
speaker_dict
,
'rt'
)
as
f
:
spk_id
=
[
line
.
strip
().
split
()
for
line
in
f
.
readlines
()]
spk_num
=
len
(
spk_id
)
print
(
"spk_num:"
,
spk_num
)
odim
=
am_config
.
n_mels
# model: {model_name}_{dataset}
am_name
=
args
.
am
[:
args
.
am
.
rindex
(
'_'
)]
am_dataset
=
args
.
am
[
args
.
am
.
rindex
(
'_'
)
+
1
:]
am_class
=
dynamic_import
(
am_name
,
model_alias
)
am_inference_class
=
dynamic_import
(
am_name
+
'_inference'
,
model_alias
)
if
am_name
==
'fastspeech2'
:
am
=
am_class
(
idim
=
vocab_size
,
odim
=
odim
,
spk_num
=
spk_num
,
**
am_config
[
"model"
])
elif
am_name
==
'speedyspeech'
:
am
=
am_class
(
vocab_size
=
vocab_size
,
tone_size
=
tone_size
,
spk_num
=
spk_num
,
**
am_config
[
"model"
])
elif
am_name
==
'tacotron2'
:
am
=
am_class
(
idim
=
vocab_size
,
odim
=
odim
,
**
am_config
[
"model"
])
am
.
set_state_dict
(
paddle
.
load
(
args
.
am_ckpt
)[
"main_params"
])
am
.
eval
()
am_mu
,
am_std
=
np
.
load
(
args
.
am_stat
)
am_mu
=
paddle
.
to_tensor
(
am_mu
)
am_std
=
paddle
.
to_tensor
(
am_std
)
am_normalizer
=
ZScore
(
am_mu
,
am_std
)
am_inference
=
am_inference_class
(
am_normalizer
,
am
)
am_inference
.
eval
()
print
(
"acoustic model done!"
)
return
am_inference
,
am_name
,
am_dataset
def
get_voc_inference
(
args
,
voc_config
):
# model: {model_name}_{dataset}
voc_name
=
args
.
voc
[:
args
.
voc
.
rindex
(
'_'
)]
voc_class
=
dynamic_import
(
voc_name
,
model_alias
)
voc_inference_class
=
dynamic_import
(
voc_name
+
'_inference'
,
model_alias
)
if
voc_name
!=
'wavernn'
:
voc
=
voc_class
(
**
voc_config
[
"generator_params"
])
voc
.
set_state_dict
(
paddle
.
load
(
args
.
voc_ckpt
)[
"generator_params"
])
voc
.
remove_weight_norm
()
voc
.
eval
()
else
:
voc
=
voc_class
(
**
voc_config
[
"model"
])
voc
.
set_state_dict
(
paddle
.
load
(
args
.
voc_ckpt
)[
"main_params"
])
voc
.
eval
()
voc_mu
,
voc_std
=
np
.
load
(
args
.
voc_stat
)
voc_mu
=
paddle
.
to_tensor
(
voc_mu
)
voc_std
=
paddle
.
to_tensor
(
voc_std
)
voc_normalizer
=
ZScore
(
voc_mu
,
voc_std
)
voc_inference
=
voc_inference_class
(
voc_normalizer
,
voc
)
voc_inference
.
eval
()
print
(
"voc done!"
)
return
voc_inference
# to static
def
am_to_static
(
args
,
am_inference
,
am_name
,
am_dataset
):
if
am_name
==
'fastspeech2'
:
if
am_dataset
in
{
"aishell3"
,
"vctk"
}
and
args
.
speaker_dict
:
am_inference
=
jit
.
to_static
(
am_inference
,
input_spec
=
[
InputSpec
([
-
1
],
dtype
=
paddle
.
int64
),
InputSpec
([
1
],
dtype
=
paddle
.
int64
),
],
)
else
:
am_inference
=
jit
.
to_static
(
am_inference
,
input_spec
=
[
InputSpec
([
-
1
],
dtype
=
paddle
.
int64
)])
elif
am_name
==
'speedyspeech'
:
if
am_dataset
in
{
"aishell3"
,
"vctk"
}
and
args
.
speaker_dict
:
am_inference
=
jit
.
to_static
(
am_inference
,
input_spec
=
[
InputSpec
([
-
1
],
dtype
=
paddle
.
int64
),
# text
InputSpec
([
-
1
],
dtype
=
paddle
.
int64
),
# tone
InputSpec
([
1
],
dtype
=
paddle
.
int64
),
# spk_id
None
# duration
])
else
:
am_inference
=
jit
.
to_static
(
am_inference
,
input_spec
=
[
InputSpec
([
-
1
],
dtype
=
paddle
.
int64
),
InputSpec
([
-
1
],
dtype
=
paddle
.
int64
)
])
elif
am_name
==
'tacotron2'
:
am_inference
=
jit
.
to_static
(
am_inference
,
input_spec
=
[
InputSpec
([
-
1
],
dtype
=
paddle
.
int64
)])
paddle
.
jit
.
save
(
am_inference
,
os
.
path
.
join
(
args
.
inference_dir
,
args
.
am
))
am_inference
=
paddle
.
jit
.
load
(
os
.
path
.
join
(
args
.
inference_dir
,
args
.
am
))
return
am_inference
def
voc_to_static
(
args
,
voc_inference
):
voc_inference
=
jit
.
to_static
(
voc_inference
,
input_spec
=
[
InputSpec
([
-
1
,
80
],
dtype
=
paddle
.
float32
),
])
paddle
.
jit
.
save
(
voc_inference
,
os
.
path
.
join
(
args
.
inference_dir
,
args
.
voc
))
voc_inference
=
paddle
.
jit
.
load
(
os
.
path
.
join
(
args
.
inference_dir
,
args
.
voc
))
return
voc_inference
paddlespeech/t2s/exps/synthesize.py
浏览文件 @
40ab05a4
...
...
@@ -23,48 +23,11 @@ import yaml
from
timer
import
timer
from
yacs.config
import
CfgNode
from
paddlespeech.
s2t.utils.dynamic_import
import
dynamic_import
from
paddlespeech.t2s.
datasets.data_table
import
DataTable
from
paddlespeech.t2s.
modules.normalizer
import
ZScor
e
from
paddlespeech.
t2s.exps.syn_utils
import
get_am_inference
from
paddlespeech.t2s.
exps.syn_utils
import
get_test_dataset
from
paddlespeech.t2s.
exps.syn_utils
import
get_voc_inferenc
e
from
paddlespeech.t2s.utils
import
str2bool
model_alias
=
{
# acoustic model
"speedyspeech"
:
"paddlespeech.t2s.models.speedyspeech:SpeedySpeech"
,
"speedyspeech_inference"
:
"paddlespeech.t2s.models.speedyspeech:SpeedySpeechInference"
,
"fastspeech2"
:
"paddlespeech.t2s.models.fastspeech2:FastSpeech2"
,
"fastspeech2_inference"
:
"paddlespeech.t2s.models.fastspeech2:FastSpeech2Inference"
,
"tacotron2"
:
"paddlespeech.t2s.models.tacotron2:Tacotron2"
,
"tacotron2_inference"
:
"paddlespeech.t2s.models.tacotron2:Tacotron2Inference"
,
# voc
"pwgan"
:
"paddlespeech.t2s.models.parallel_wavegan:PWGGenerator"
,
"pwgan_inference"
:
"paddlespeech.t2s.models.parallel_wavegan:PWGInference"
,
"mb_melgan"
:
"paddlespeech.t2s.models.melgan:MelGANGenerator"
,
"mb_melgan_inference"
:
"paddlespeech.t2s.models.melgan:MelGANInference"
,
"style_melgan"
:
"paddlespeech.t2s.models.melgan:StyleMelGANGenerator"
,
"style_melgan_inference"
:
"paddlespeech.t2s.models.melgan:StyleMelGANInference"
,
"hifigan"
:
"paddlespeech.t2s.models.hifigan:HiFiGANGenerator"
,
"hifigan_inference"
:
"paddlespeech.t2s.models.hifigan:HiFiGANInference"
,
"wavernn"
:
"paddlespeech.t2s.models.wavernn:WaveRNN"
,
"wavernn_inference"
:
"paddlespeech.t2s.models.wavernn:WaveRNNInference"
,
}
def
evaluate
(
args
):
# dataloader has been too verbose
...
...
@@ -86,96 +49,12 @@ def evaluate(args):
print
(
am_config
)
print
(
voc_config
)
# construct dataset for evaluation
# model: {model_name}_{dataset}
am_name
=
args
.
am
[:
args
.
am
.
rindex
(
'_'
)]
am_dataset
=
args
.
am
[
args
.
am
.
rindex
(
'_'
)
+
1
:]
if
am_name
==
'fastspeech2'
:
fields
=
[
"utt_id"
,
"text"
]
spk_num
=
None
if
am_dataset
in
{
"aishell3"
,
"vctk"
}
and
args
.
speaker_dict
:
print
(
"multiple speaker fastspeech2!"
)
with
open
(
args
.
speaker_dict
,
'rt'
)
as
f
:
spk_id
=
[
line
.
strip
().
split
()
for
line
in
f
.
readlines
()]
spk_num
=
len
(
spk_id
)
fields
+=
[
"spk_id"
]
elif
args
.
voice_cloning
:
print
(
"voice cloning!"
)
fields
+=
[
"spk_emb"
]
else
:
print
(
"single speaker fastspeech2!"
)
print
(
"spk_num:"
,
spk_num
)
elif
am_name
==
'speedyspeech'
:
fields
=
[
"utt_id"
,
"phones"
,
"tones"
]
elif
am_name
==
'tacotron2'
:
fields
=
[
"utt_id"
,
"text"
]
if
args
.
voice_cloning
:
print
(
"voice cloning!"
)
fields
+=
[
"spk_emb"
]
test_dataset
=
DataTable
(
data
=
test_metadata
,
fields
=
fields
)
with
open
(
args
.
phones_dict
,
"r"
)
as
f
:
phn_id
=
[
line
.
strip
().
split
()
for
line
in
f
.
readlines
()]
vocab_size
=
len
(
phn_id
)
print
(
"vocab_size:"
,
vocab_size
)
tone_size
=
None
if
args
.
tones_dict
:
with
open
(
args
.
tones_dict
,
"r"
)
as
f
:
tone_id
=
[
line
.
strip
().
split
()
for
line
in
f
.
readlines
()]
tone_size
=
len
(
tone_id
)
print
(
"tone_size:"
,
tone_size
)
# acoustic model
odim
=
am_config
.
n_mels
am_class
=
dynamic_import
(
am_name
,
model_alias
)
am_inference_class
=
dynamic_import
(
am_name
+
'_inference'
,
model_alias
)
if
am_name
==
'fastspeech2'
:
am
=
am_class
(
idim
=
vocab_size
,
odim
=
odim
,
spk_num
=
spk_num
,
**
am_config
[
"model"
])
elif
am_name
==
'speedyspeech'
:
am
=
am_class
(
vocab_size
=
vocab_size
,
tone_size
=
tone_size
,
**
am_config
[
"model"
])
elif
am_name
==
'tacotron2'
:
am
=
am_class
(
idim
=
vocab_size
,
odim
=
odim
,
**
am_config
[
"model"
])
am
.
set_state_dict
(
paddle
.
load
(
args
.
am_ckpt
)[
"main_params"
])
am
.
eval
()
am_mu
,
am_std
=
np
.
load
(
args
.
am_stat
)
am_mu
=
paddle
.
to_tensor
(
am_mu
)
am_std
=
paddle
.
to_tensor
(
am_std
)
am_normalizer
=
ZScore
(
am_mu
,
am_std
)
am_inference
=
am_inference_class
(
am_normalizer
,
am
)
print
(
"am_inference.training0:"
,
am_inference
.
training
)
am_inference
.
eval
()
print
(
"acoustic model done!"
)
am_inference
,
am_name
,
am_dataset
=
get_am_inference
(
args
,
am_config
)
test_dataset
=
get_test_dataset
(
args
,
test_metadata
,
am_name
,
am_dataset
)
# vocoder
# model: {model_name}_{dataset}
voc_name
=
args
.
voc
[:
args
.
voc
.
rindex
(
'_'
)]
voc_class
=
dynamic_import
(
voc_name
,
model_alias
)
voc_inference_class
=
dynamic_import
(
voc_name
+
'_inference'
,
model_alias
)
if
voc_name
!=
'wavernn'
:
voc
=
voc_class
(
**
voc_config
[
"generator_params"
])
voc
.
set_state_dict
(
paddle
.
load
(
args
.
voc_ckpt
)[
"generator_params"
])
voc
.
remove_weight_norm
()
voc
.
eval
()
else
:
voc
=
voc_class
(
**
voc_config
[
"model"
])
voc
.
set_state_dict
(
paddle
.
load
(
args
.
voc_ckpt
)[
"main_params"
])
voc
.
eval
()
voc_mu
,
voc_std
=
np
.
load
(
args
.
voc_stat
)
voc_mu
=
paddle
.
to_tensor
(
voc_mu
)
voc_std
=
paddle
.
to_tensor
(
voc_std
)
voc_normalizer
=
ZScore
(
voc_mu
,
voc_std
)
voc_inference
=
voc_inference_class
(
voc_normalizer
,
voc
)
print
(
"voc_inference.training0:"
,
voc_inference
.
training
)
voc_inference
.
eval
()
print
(
"voc done!"
)
voc_inference
=
get_voc_inference
(
args
,
voc_config
)
output_dir
=
Path
(
args
.
output_dir
)
output_dir
.
mkdir
(
parents
=
True
,
exist_ok
=
True
)
...
...
@@ -227,7 +106,7 @@ def evaluate(args):
print
(
f
"generation speed:
{
N
/
T
}
Hz, RTF:
{
am_config
.
fs
/
(
N
/
T
)
}
"
)
def
main
():
def
parse_args
():
# parse args and config and redirect to train_sp
parser
=
argparse
.
ArgumentParser
(
description
=
"Synthesize with acoustic model & vocoder"
)
...
...
@@ -264,7 +143,6 @@ def main():
"--tones_dict"
,
type
=
str
,
default
=
None
,
help
=
"tone vocabulary file."
)
parser
.
add_argument
(
"--speaker_dict"
,
type
=
str
,
default
=
None
,
help
=
"speaker id map file."
)
parser
.
add_argument
(
"--voice-cloning"
,
type
=
str2bool
,
...
...
@@ -281,7 +159,6 @@ def main():
'style_melgan_csmsc'
],
help
=
'Choose vocoder type of tts task.'
)
parser
.
add_argument
(
'--voc_config'
,
type
=
str
,
...
...
@@ -302,7 +179,12 @@ def main():
parser
.
add_argument
(
"--output_dir"
,
type
=
str
,
help
=
"output dir."
)
args
=
parser
.
parse_args
()
return
args
def
main
():
args
=
parse_args
()
if
args
.
ngpu
==
0
:
paddle
.
set_device
(
"cpu"
)
elif
args
.
ngpu
>
0
:
...
...
paddlespeech/t2s/exps/synthesize_e2e.py
浏览文件 @
40ab05a4
...
...
@@ -12,59 +12,20 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import
argparse
import
os
from
pathlib
import
Path
import
numpy
as
np
import
paddle
import
soundfile
as
sf
import
yaml
from
paddle
import
jit
from
paddle.static
import
InputSpec
from
timer
import
timer
from
yacs.config
import
CfgNode
from
paddlespeech.s2t.utils.dynamic_import
import
dynamic_import
from
paddlespeech.t2s.frontend
import
English
from
paddlespeech.t2s.frontend.zh_frontend
import
Frontend
from
paddlespeech.t2s.modules.normalizer
import
ZScore
model_alias
=
{
# acoustic model
"speedyspeech"
:
"paddlespeech.t2s.models.speedyspeech:SpeedySpeech"
,
"speedyspeech_inference"
:
"paddlespeech.t2s.models.speedyspeech:SpeedySpeechInference"
,
"fastspeech2"
:
"paddlespeech.t2s.models.fastspeech2:FastSpeech2"
,
"fastspeech2_inference"
:
"paddlespeech.t2s.models.fastspeech2:FastSpeech2Inference"
,
"tacotron2"
:
"paddlespeech.t2s.models.tacotron2:Tacotron2"
,
"tacotron2_inference"
:
"paddlespeech.t2s.models.tacotron2:Tacotron2Inference"
,
# voc
"pwgan"
:
"paddlespeech.t2s.models.parallel_wavegan:PWGGenerator"
,
"pwgan_inference"
:
"paddlespeech.t2s.models.parallel_wavegan:PWGInference"
,
"mb_melgan"
:
"paddlespeech.t2s.models.melgan:MelGANGenerator"
,
"mb_melgan_inference"
:
"paddlespeech.t2s.models.melgan:MelGANInference"
,
"style_melgan"
:
"paddlespeech.t2s.models.melgan:StyleMelGANGenerator"
,
"style_melgan_inference"
:
"paddlespeech.t2s.models.melgan:StyleMelGANInference"
,
"hifigan"
:
"paddlespeech.t2s.models.hifigan:HiFiGANGenerator"
,
"hifigan_inference"
:
"paddlespeech.t2s.models.hifigan:HiFiGANInference"
,
"wavernn"
:
"paddlespeech.t2s.models.wavernn:WaveRNN"
,
"wavernn_inference"
:
"paddlespeech.t2s.models.wavernn:WaveRNNInference"
,
}
from
paddlespeech.t2s.exps.syn_utils
import
am_to_static
from
paddlespeech.t2s.exps.syn_utils
import
get_am_inference
from
paddlespeech.t2s.exps.syn_utils
import
get_frontend
from
paddlespeech.t2s.exps.syn_utils
import
get_sentences
from
paddlespeech.t2s.exps.syn_utils
import
get_voc_inference
from
paddlespeech.t2s.exps.syn_utils
import
voc_to_static
def
evaluate
(
args
):
...
...
@@ -81,155 +42,28 @@ def evaluate(args):
print
(
am_config
)
print
(
voc_config
)
# construct dataset for evaluation
sentences
=
[]
with
open
(
args
.
text
,
'rt'
)
as
f
:
for
line
in
f
:
items
=
line
.
strip
().
split
()
utt_id
=
items
[
0
]
if
args
.
lang
==
'zh'
:
sentence
=
""
.
join
(
items
[
1
:])
elif
args
.
lang
==
'en'
:
sentence
=
" "
.
join
(
items
[
1
:])
sentences
.
append
((
utt_id
,
sentence
))
with
open
(
args
.
phones_dict
,
"r"
)
as
f
:
phn_id
=
[
line
.
strip
().
split
()
for
line
in
f
.
readlines
()]
vocab_size
=
len
(
phn_id
)
print
(
"vocab_size:"
,
vocab_size
)
tone_size
=
None
if
args
.
tones_dict
:
with
open
(
args
.
tones_dict
,
"r"
)
as
f
:
tone_id
=
[
line
.
strip
().
split
()
for
line
in
f
.
readlines
()]
tone_size
=
len
(
tone_id
)
print
(
"tone_size:"
,
tone_size
)
spk_num
=
None
if
args
.
speaker_dict
:
with
open
(
args
.
speaker_dict
,
'rt'
)
as
f
:
spk_id
=
[
line
.
strip
().
split
()
for
line
in
f
.
readlines
()]
spk_num
=
len
(
spk_id
)
print
(
"spk_num:"
,
spk_num
)
sentences
=
get_sentences
(
args
)
# frontend
if
args
.
lang
==
'zh'
:
frontend
=
Frontend
(
phone_vocab_path
=
args
.
phones_dict
,
tone_vocab_path
=
args
.
tones_dict
)
elif
args
.
lang
==
'en'
:
frontend
=
English
(
phone_vocab_path
=
args
.
phones_dict
)
print
(
"frontend done!"
)
frontend
=
get_frontend
(
args
)
# acoustic model
odim
=
am_config
.
n_mels
# model: {model_name}_{dataset}
am_name
=
args
.
am
[:
args
.
am
.
rindex
(
'_'
)]
am_dataset
=
args
.
am
[
args
.
am
.
rindex
(
'_'
)
+
1
:]
am_class
=
dynamic_import
(
am_name
,
model_alias
)
am_inference_class
=
dynamic_import
(
am_name
+
'_inference'
,
model_alias
)
if
am_name
==
'fastspeech2'
:
am
=
am_class
(
idim
=
vocab_size
,
odim
=
odim
,
spk_num
=
spk_num
,
**
am_config
[
"model"
])
elif
am_name
==
'speedyspeech'
:
am
=
am_class
(
vocab_size
=
vocab_size
,
tone_size
=
tone_size
,
spk_num
=
spk_num
,
**
am_config
[
"model"
])
elif
am_name
==
'tacotron2'
:
am
=
am_class
(
idim
=
vocab_size
,
odim
=
odim
,
**
am_config
[
"model"
])
am
.
set_state_dict
(
paddle
.
load
(
args
.
am_ckpt
)[
"main_params"
])
am
.
eval
()
am_mu
,
am_std
=
np
.
load
(
args
.
am_stat
)
am_mu
=
paddle
.
to_tensor
(
am_mu
)
am_std
=
paddle
.
to_tensor
(
am_std
)
am_normalizer
=
ZScore
(
am_mu
,
am_std
)
am_inference
=
am_inference_class
(
am_normalizer
,
am
)
am_inference
.
eval
()
print
(
"acoustic model done!"
)
am_inference
,
am_name
,
am_dataset
=
get_am_inference
(
args
,
am_config
)
# vocoder
# model: {model_name}_{dataset}
voc_name
=
args
.
voc
[:
args
.
voc
.
rindex
(
'_'
)]
voc_class
=
dynamic_import
(
voc_name
,
model_alias
)
voc_inference_class
=
dynamic_import
(
voc_name
+
'_inference'
,
model_alias
)
if
voc_name
!=
'wavernn'
:
voc
=
voc_class
(
**
voc_config
[
"generator_params"
])
voc
.
set_state_dict
(
paddle
.
load
(
args
.
voc_ckpt
)[
"generator_params"
])
voc
.
remove_weight_norm
()
voc
.
eval
()
else
:
voc
=
voc_class
(
**
voc_config
[
"model"
])
voc
.
set_state_dict
(
paddle
.
load
(
args
.
voc_ckpt
)[
"main_params"
])
voc
.
eval
()
voc_mu
,
voc_std
=
np
.
load
(
args
.
voc_stat
)
voc_mu
=
paddle
.
to_tensor
(
voc_mu
)
voc_std
=
paddle
.
to_tensor
(
voc_std
)
voc_normalizer
=
ZScore
(
voc_mu
,
voc_std
)
voc_inference
=
voc_inference_class
(
voc_normalizer
,
voc
)
voc_inference
.
eval
()
print
(
"voc done!"
)
voc_inference
=
get_voc_inference
(
args
,
voc_config
)
# whether dygraph to static
if
args
.
inference_dir
:
# acoustic model
if
am_name
==
'fastspeech2'
:
if
am_dataset
in
{
"aishell3"
,
"vctk"
}
and
args
.
speaker_dict
:
am_inference
=
jit
.
to_static
(
am_inference
,
input_spec
=
[
InputSpec
([
-
1
],
dtype
=
paddle
.
int64
),
InputSpec
([
1
],
dtype
=
paddle
.
int64
)
])
else
:
am_inference
=
jit
.
to_static
(
am_inference
,
input_spec
=
[
InputSpec
([
-
1
],
dtype
=
paddle
.
int64
)])
elif
am_name
==
'speedyspeech'
:
if
am_dataset
in
{
"aishell3"
,
"vctk"
}
and
args
.
speaker_dict
:
am_inference
=
jit
.
to_static
(
am_inference
,
input_spec
=
[
InputSpec
([
-
1
],
dtype
=
paddle
.
int64
),
# text
InputSpec
([
-
1
],
dtype
=
paddle
.
int64
),
# tone
InputSpec
([
1
],
dtype
=
paddle
.
int64
),
# spk_id
None
# duration
])
else
:
am_inference
=
jit
.
to_static
(
am_inference
,
input_spec
=
[
InputSpec
([
-
1
],
dtype
=
paddle
.
int64
),
InputSpec
([
-
1
],
dtype
=
paddle
.
int64
)
])
elif
am_name
==
'tacotron2'
:
am_inference
=
jit
.
to_static
(
am_inference
,
input_spec
=
[
InputSpec
([
-
1
],
dtype
=
paddle
.
int64
)])
paddle
.
jit
.
save
(
am_inference
,
os
.
path
.
join
(
args
.
inference_dir
,
args
.
am
))
am_inference
=
paddle
.
jit
.
load
(
os
.
path
.
join
(
args
.
inference_dir
,
args
.
am
))
am_inference
=
am_to_static
(
args
,
am_inference
,
am_name
,
am_dataset
)
# vocoder
voc_inference
=
jit
.
to_static
(
voc_inference
,
input_spec
=
[
InputSpec
([
-
1
,
80
],
dtype
=
paddle
.
float32
),
])
paddle
.
jit
.
save
(
voc_inference
,
os
.
path
.
join
(
args
.
inference_dir
,
args
.
voc
))
voc_inference
=
paddle
.
jit
.
load
(
os
.
path
.
join
(
args
.
inference_dir
,
args
.
voc
))
voc_inference
=
voc_to_static
(
args
,
voc_inference
)
output_dir
=
Path
(
args
.
output_dir
)
output_dir
.
mkdir
(
parents
=
True
,
exist_ok
=
True
)
merge_sentences
=
Fals
e
merge_sentences
=
Tru
e
# Avoid not stopping at the end of a sub sentence when tacotron2_ljspeech dygraph to static graph
# but still not stopping in the end (NOTE by yuantian01 Feb 9 2022)
if
am_name
==
'tacotron2'
:
...
...
@@ -298,7 +132,7 @@ def evaluate(args):
print
(
f
"generation speed:
{
N
/
T
}
Hz, RTF:
{
am_config
.
fs
/
(
N
/
T
)
}
"
)
def
main
():
def
parse_args
():
# parse args and config and redirect to train_sp
parser
=
argparse
.
ArgumentParser
(
description
=
"Synthesize with acoustic model & vocoder"
)
...
...
@@ -351,7 +185,6 @@ def main():
'wavernn_csmsc'
],
help
=
'Choose vocoder type of tts task.'
)
parser
.
add_argument
(
'--voc_config'
,
type
=
str
,
...
...
@@ -386,6 +219,11 @@ def main():
parser
.
add_argument
(
"--output_dir"
,
type
=
str
,
help
=
"output dir."
)
args
=
parser
.
parse_args
()
return
args
def
main
():
args
=
parse_args
()
if
args
.
ngpu
==
0
:
paddle
.
set_device
(
"cpu"
)
...
...
paddlespeech/t2s/exps/voice_cloning.py
浏览文件 @
40ab05a4
...
...
@@ -21,29 +21,12 @@ import soundfile as sf
import
yaml
from
yacs.config
import
CfgNode
from
paddlespeech.s2t.utils.dynamic_import
import
dynamic_import
from
paddlespeech.t2s.exps.syn_utils
import
get_am_inference
from
paddlespeech.t2s.exps.syn_utils
import
get_voc_inference
from
paddlespeech.t2s.frontend.zh_frontend
import
Frontend
from
paddlespeech.t2s.modules.normalizer
import
ZScore
from
paddlespeech.vector.exps.ge2e.audio_processor
import
SpeakerVerificationPreprocessor
from
paddlespeech.vector.models.lstm_speaker_encoder
import
LSTMSpeakerEncoder
model_alias
=
{
# acoustic model
"fastspeech2"
:
"paddlespeech.t2s.models.fastspeech2:FastSpeech2"
,
"fastspeech2_inference"
:
"paddlespeech.t2s.models.fastspeech2:FastSpeech2Inference"
,
"tacotron2"
:
"paddlespeech.t2s.models.tacotron2:Tacotron2"
,
"tacotron2_inference"
:
"paddlespeech.t2s.models.tacotron2:Tacotron2Inference"
,
# voc
"pwgan"
:
"paddlespeech.t2s.models.parallel_wavegan:PWGGenerator"
,
"pwgan_inference"
:
"paddlespeech.t2s.models.parallel_wavegan:PWGInference"
,
}
def
voice_cloning
(
args
):
# Init body.
...
...
@@ -79,55 +62,14 @@ def voice_cloning(args):
speaker_encoder
.
eval
()
print
(
"GE2E Done!"
)
with
open
(
args
.
phones_dict
,
"r"
)
as
f
:
phn_id
=
[
line
.
strip
().
split
()
for
line
in
f
.
readlines
()]
vocab_size
=
len
(
phn_id
)
print
(
"vocab_size:"
,
vocab_size
)
frontend
=
Frontend
(
phone_vocab_path
=
args
.
phones_dict
)
print
(
"frontend done!"
)
# acoustic model
odim
=
am_config
.
n_mels
# model: {model_name}_{dataset}
am_name
=
args
.
am
[:
args
.
am
.
rindex
(
'_'
)]
am_dataset
=
args
.
am
[
args
.
am
.
rindex
(
'_'
)
+
1
:]
am_class
=
dynamic_import
(
am_name
,
model_alias
)
am_inference_class
=
dynamic_import
(
am_name
+
'_inference'
,
model_alias
)
if
am_name
==
'fastspeech2'
:
am
=
am_class
(
idim
=
vocab_size
,
odim
=
odim
,
spk_num
=
None
,
**
am_config
[
"model"
])
elif
am_name
==
'tacotron2'
:
am
=
am_class
(
idim
=
vocab_size
,
odim
=
odim
,
**
am_config
[
"model"
])
am
.
set_state_dict
(
paddle
.
load
(
args
.
am_ckpt
)[
"main_params"
])
am
.
eval
()
am_mu
,
am_std
=
np
.
load
(
args
.
am_stat
)
am_mu
=
paddle
.
to_tensor
(
am_mu
)
am_std
=
paddle
.
to_tensor
(
am_std
)
am_normalizer
=
ZScore
(
am_mu
,
am_std
)
am_inference
=
am_inference_class
(
am_normalizer
,
am
)
am_inference
.
eval
()
print
(
"acoustic model done!"
)
am_inference
,
*
_
=
get_am_inference
(
args
,
am_config
)
# vocoder
# model: {model_name}_{dataset}
voc_name
=
args
.
voc
[:
args
.
voc
.
rindex
(
'_'
)]
voc_class
=
dynamic_import
(
voc_name
,
model_alias
)
voc_inference_class
=
dynamic_import
(
voc_name
+
'_inference'
,
model_alias
)
voc
=
voc_class
(
**
voc_config
[
"generator_params"
])
voc
.
set_state_dict
(
paddle
.
load
(
args
.
voc_ckpt
)[
"generator_params"
])
voc
.
remove_weight_norm
()
voc
.
eval
()
voc_mu
,
voc_std
=
np
.
load
(
args
.
voc_stat
)
voc_mu
=
paddle
.
to_tensor
(
voc_mu
)
voc_std
=
paddle
.
to_tensor
(
voc_std
)
voc_normalizer
=
ZScore
(
voc_mu
,
voc_std
)
voc_inference
=
voc_inference_class
(
voc_normalizer
,
voc
)
voc_inference
.
eval
()
print
(
"voc done!"
)
frontend
=
Frontend
(
phone_vocab_path
=
args
.
phones_dict
)
print
(
"frontend done!"
)
voc_inference
=
get_voc_inference
(
args
,
voc_config
)
output_dir
=
Path
(
args
.
output_dir
)
output_dir
.
mkdir
(
parents
=
True
,
exist_ok
=
True
)
...
...
@@ -170,7 +112,7 @@ def voice_cloning(args):
print
(
f
"
{
utt_id
}
done!"
)
def
main
():
def
parse_args
():
# parse args and config and redirect to train_sp
parser
=
argparse
.
ArgumentParser
(
description
=
""
)
parser
.
add_argument
(
...
...
@@ -240,6 +182,11 @@ def main():
parser
.
add_argument
(
"--output-dir"
,
type
=
str
,
help
=
"output dir."
)
args
=
parser
.
parse_args
()
return
args
def
main
():
args
=
parse_args
()
if
args
.
ngpu
==
0
:
paddle
.
set_device
(
"cpu"
)
...
...
paddlespeech/t2s/modules/predictor/length_regulator.py
浏览文件 @
40ab05a4
...
...
@@ -101,6 +101,16 @@ class LengthRegulator(nn.Layer):
assert
alpha
>
0
ds
=
paddle
.
round
(
ds
.
cast
(
dtype
=
paddle
.
float32
)
*
alpha
)
ds
=
ds
.
cast
(
dtype
=
paddle
.
int64
)
'''
from distutils.version import LooseVersion
from paddlespeech.t2s.modules.nets_utils import pad_list
# 这里在 paddle 2.2.2 的动转静是不通的
# if LooseVersion(paddle.__version__) >= "2.3.0" or hasattr(paddle, 'repeat_interleave'):
# if LooseVersion(paddle.__version__) >= "2.3.0":
if hasattr(paddle, 'repeat_interleave'):
repeat = [paddle.repeat_interleave(x, d, axis=0) for x, d in zip(xs, ds)]
return pad_list(repeat, self.pad_value)
'''
if
is_inference
:
return
self
.
expand
(
xs
,
ds
)
else
:
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录