Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
DeepSpeech
提交
5503c8bd
D
DeepSpeech
项目概览
PaddlePaddle
/
DeepSpeech
大约 2 年 前同步成功
通知
210
Star
8425
Fork
1598
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
245
列表
看板
标记
里程碑
合并请求
3
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
D
DeepSpeech
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
245
Issue
245
列表
看板
标记
里程碑
合并请求
3
合并请求
3
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
提交
5503c8bd
编写于
7月 12, 2022
作者:
小湉湉
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
add ernie_sat synthesize script for metadata.jsonl, test=tts
上级
028742b6
变更
11
显示空白变更内容
内联
并排
Showing
11 changed file
with
357 addition
and
28 deletion
+357
-28
examples/aishell3/ernie_sat/conf/default.yaml
examples/aishell3/ernie_sat/conf/default.yaml
+6
-5
examples/aishell3_vctk/ernie_sat/conf/default.yaml
examples/aishell3_vctk/ernie_sat/conf/default.yaml
+6
-5
examples/aishell3_vctk/ernie_sat/local/preprocess.sh
examples/aishell3_vctk/ernie_sat/local/preprocess.sh
+32
-10
examples/aishell3_vctk/ernie_sat/path.sh
examples/aishell3_vctk/ernie_sat/path.sh
+1
-1
examples/vctk/ernie_sat/local/synthesize.sh
examples/vctk/ernie_sat/local/synthesize.sh
+45
-1
paddlespeech/t2s/datasets/am_batch_fn.py
paddlespeech/t2s/datasets/am_batch_fn.py
+9
-1
paddlespeech/t2s/exps/ernie_sat/preprocess.py
paddlespeech/t2s/exps/ernie_sat/preprocess.py
+2
-1
paddlespeech/t2s/exps/ernie_sat/synthesize.py
paddlespeech/t2s/exps/ernie_sat/synthesize.py
+201
-0
paddlespeech/t2s/exps/ernie_sat/train.py
paddlespeech/t2s/exps/ernie_sat/train.py
+0
-2
paddlespeech/t2s/exps/syn_utils.py
paddlespeech/t2s/exps/syn_utils.py
+19
-1
paddlespeech/t2s/models/ernie_sat/ernie_sat.py
paddlespeech/t2s/models/ernie_sat/ernie_sat.py
+36
-1
未找到文件。
examples/aishell3/ernie_sat/conf/default.yaml
浏览文件 @
5503c8bd
...
@@ -21,7 +21,7 @@ mlm_prob: 0.8
...
@@ -21,7 +21,7 @@ mlm_prob: 0.8
###########################################################
###########################################################
# DATA SETTING #
# DATA SETTING #
###########################################################
###########################################################
batch_size
:
64
batch_size
:
20
num_workers
:
2
num_workers
:
2
###########################################################
###########################################################
...
@@ -71,14 +71,15 @@ model:
...
@@ -71,14 +71,15 @@ model:
###########################################################
###########################################################
# OPTIMIZER SETTING #
# OPTIMIZER SETTING #
###########################################################
###########################################################
optimizer
:
scheduler_params
:
optim
:
adam
# optimizer type
d_model
:
384
learning_rate
:
0.001
# learning rate
warmup_steps
:
4000
grad_clip
:
1.0
###########################################################
###########################################################
# TRAINING SETTING #
# TRAINING SETTING #
###########################################################
###########################################################
max_epoch
:
2
00
max_epoch
:
6
00
num_snapshots
:
5
num_snapshots
:
5
###########################################################
###########################################################
...
...
examples/aishell3_vctk/ernie_sat/conf/default.yaml
浏览文件 @
5503c8bd
...
@@ -21,7 +21,7 @@ mlm_prob: 0.8
...
@@ -21,7 +21,7 @@ mlm_prob: 0.8
###########################################################
###########################################################
# DATA SETTING #
# DATA SETTING #
###########################################################
###########################################################
batch_size
:
64
batch_size
:
20
num_workers
:
2
num_workers
:
2
###########################################################
###########################################################
...
@@ -71,14 +71,15 @@ model:
...
@@ -71,14 +71,15 @@ model:
###########################################################
###########################################################
# OPTIMIZER SETTING #
# OPTIMIZER SETTING #
###########################################################
###########################################################
optimizer
:
scheduler_params
:
optim
:
adam
# optimizer type
d_model
:
384
learning_rate
:
0.001
# learning rate
warmup_steps
:
4000
grad_clip
:
1.0
###########################################################
###########################################################
# TRAINING SETTING #
# TRAINING SETTING #
###########################################################
###########################################################
max_epoch
:
1
00
max_epoch
:
3
00
num_snapshots
:
5
num_snapshots
:
5
###########################################################
###########################################################
...
...
examples/aishell3_vctk/ernie_sat/local/preprocess.sh
浏览文件 @
5503c8bd
...
@@ -7,14 +7,29 @@ config_path=$1
...
@@ -7,14 +7,29 @@ config_path=$1
if
[
${
stage
}
-le
0
]
&&
[
${
stop_stage
}
-ge
0
]
;
then
if
[
${
stage
}
-le
0
]
&&
[
${
stop_stage
}
-ge
0
]
;
then
# get durations from MFA's result
# get durations from MFA's result
echo
"Generate durations.txt from MFA results ..."
echo
"Generate durations.txt from MFA results
for aishell3
..."
python3
${
MAIN_ROOT
}
/utils/gen_duration_from_textgrid.py
\
python3
${
MAIN_ROOT
}
/utils/gen_duration_from_textgrid.py
\
--inputdir
=
./aishell3_alignment_tone
\
--inputdir
=
./aishell3_alignment_tone
\
--output
durations.txt
\
--output
durations
_aishell3
.txt
\
--config
=
${
config_path
}
--config
=
${
config_path
}
fi
fi
if
[
${
stage
}
-le
1
]
&&
[
${
stop_stage
}
-ge
1
]
;
then
if
[
${
stage
}
-le
1
]
&&
[
${
stop_stage
}
-ge
1
]
;
then
# get durations from MFA's result
echo
"Generate durations.txt from MFA results for vctk ..."
python3
${
MAIN_ROOT
}
/utils/gen_duration_from_textgrid.py
\
--inputdir
=
./vctk_alignment
\
--output
durations_vctk.txt
\
--config
=
${
config_path
}
fi
if
[
${
stage
}
-le
2
]
&&
[
${
stop_stage
}
-ge
2
]
;
then
# get durations from MFA's result
echo
"concat durations_aishell3.txt and durations_vctk.txt to durations.txt"
cat
durations_aishell3.txt durations_vctk.txt
>
durations.txt
fi
if
[
${
stage
}
-le
3
]
&&
[
${
stop_stage
}
-ge
3
]
;
then
# extract features
# extract features
echo
"Extract features ..."
echo
"Extract features ..."
python3
${
BIN_DIR
}
/preprocess.py
\
python3
${
BIN_DIR
}
/preprocess.py
\
...
@@ -27,7 +42,20 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
...
@@ -27,7 +42,20 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
--cut-sil
=
True
--cut-sil
=
True
fi
fi
if
[
${
stage
}
-le
2
]
&&
[
${
stop_stage
}
-ge
2
]
;
then
if
[
${
stage
}
-le
4
]
&&
[
${
stop_stage
}
-ge
4
]
;
then
# extract features
echo
"Extract features ..."
python3
${
BIN_DIR
}
/preprocess.py
\
--dataset
=
vctk
\
--rootdir
=
~/datasets/VCTK-Corpus-0.92/
\
--dumpdir
=
dump
\
--dur-file
=
durations.txt
\
--config
=
${
config_path
}
\
--num-cpu
=
20
\
--cut-sil
=
True
fi
if
[
${
stage
}
-le
5
]
&&
[
${
stop_stage
}
-ge
5
]
;
then
# get features' stats(mean and std)
# get features' stats(mean and std)
echo
"Get features' stats ..."
echo
"Get features' stats ..."
python3
${
MAIN_ROOT
}
/utils/compute_statistics.py
\
python3
${
MAIN_ROOT
}
/utils/compute_statistics.py
\
...
@@ -35,15 +63,13 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
...
@@ -35,15 +63,13 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
--field-name
=
"speech"
--field-name
=
"speech"
fi
fi
if
[
${
stage
}
-le
3
]
&&
[
${
stop_stage
}
-ge
3
]
;
then
if
[
${
stage
}
-le
6
]
&&
[
${
stop_stage
}
-ge
6
]
;
then
# normalize and covert phone/speaker to id, dev and test should use train's stats
# normalize and covert phone/speaker to id, dev and test should use train's stats
echo
"Normalize ..."
echo
"Normalize ..."
python3
${
BIN_DIR
}
/normalize.py
\
python3
${
BIN_DIR
}
/normalize.py
\
--metadata
=
dump/train/raw/metadata.jsonl
\
--metadata
=
dump/train/raw/metadata.jsonl
\
--dumpdir
=
dump/train/norm
\
--dumpdir
=
dump/train/norm
\
--speech-stats
=
dump/train/speech_stats.npy
\
--speech-stats
=
dump/train/speech_stats.npy
\
--pitch-stats
=
dump/train/pitch_stats.npy
\
--energy-stats
=
dump/train/energy_stats.npy
\
--phones-dict
=
dump/phone_id_map.txt
\
--phones-dict
=
dump/phone_id_map.txt
\
--speaker-dict
=
dump/speaker_id_map.txt
--speaker-dict
=
dump/speaker_id_map.txt
...
@@ -51,8 +77,6 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
...
@@ -51,8 +77,6 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
--metadata
=
dump/dev/raw/metadata.jsonl
\
--metadata
=
dump/dev/raw/metadata.jsonl
\
--dumpdir
=
dump/dev/norm
\
--dumpdir
=
dump/dev/norm
\
--speech-stats
=
dump/train/speech_stats.npy
\
--speech-stats
=
dump/train/speech_stats.npy
\
--pitch-stats
=
dump/train/pitch_stats.npy
\
--energy-stats
=
dump/train/energy_stats.npy
\
--phones-dict
=
dump/phone_id_map.txt
\
--phones-dict
=
dump/phone_id_map.txt
\
--speaker-dict
=
dump/speaker_id_map.txt
--speaker-dict
=
dump/speaker_id_map.txt
...
@@ -60,8 +84,6 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
...
@@ -60,8 +84,6 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
--metadata
=
dump/test/raw/metadata.jsonl
\
--metadata
=
dump/test/raw/metadata.jsonl
\
--dumpdir
=
dump/test/norm
\
--dumpdir
=
dump/test/norm
\
--speech-stats
=
dump/train/speech_stats.npy
\
--speech-stats
=
dump/train/speech_stats.npy
\
--pitch-stats
=
dump/train/pitch_stats.npy
\
--energy-stats
=
dump/train/energy_stats.npy
\
--phones-dict
=
dump/phone_id_map.txt
\
--phones-dict
=
dump/phone_id_map.txt
\
--speaker-dict
=
dump/speaker_id_map.txt
--speaker-dict
=
dump/speaker_id_map.txt
fi
fi
examples/aishell3_vctk/ernie_sat/path.sh
浏览文件 @
5503c8bd
#!/bin/bash
#!/bin/bash
export
MAIN_ROOT
=
`
realpath
${
PWD
}
/../../
`
export
MAIN_ROOT
=
`
realpath
${
PWD
}
/../../
../
`
export
PATH
=
${
MAIN_ROOT
}
:
${
MAIN_ROOT
}
/utils:
${
PATH
}
export
PATH
=
${
MAIN_ROOT
}
:
${
MAIN_ROOT
}
/utils:
${
PATH
}
export
LC_ALL
=
C
export
LC_ALL
=
C
...
...
examples/vctk/ernie_sat/local/synthesize.sh
浏览文件 @
5503c8bd
#!/bin/bash
#!/bin/bash
config_path
=
$1
train_output_path
=
$2
ckpt_name
=
$3
stage
=
1
stop_stage
=
1
# use am to predict duration here
# 增加 am_phones_dict am_tones_dict 等,也可以用新的方式构造 am, 不需要这么多参数了就
# pwgan
if
[
${
stage
}
-le
0
]
&&
[
${
stop_stage
}
-ge
0
]
;
then
FLAGS_allocator_strategy
=
naive_best_fit
\
FLAGS_fraction_of_gpu_memory_to_use
=
0.01
\
python3
${
BIN_DIR
}
/synthesize.py
\
--erniesat_config
=
${
config_path
}
\
--erniesat_ckpt
=
${
train_output_path
}
/checkpoints/
${
ckpt_name
}
\
--erniesat_stat
=
dump/train/speech_stats.npy
\
--voc
=
pwgan_vctk
\
--voc_config
=
pwg_vctk_ckpt_0.1.1/default.yaml
\
--voc_ckpt
=
pwg_vctk_ckpt_0.1.1/snapshot_iter_1500000.pdz
\
--voc_stat
=
pwg_vctk_ckpt_0.1.1/feats_stats.npy
\
--test_metadata
=
dump/test/norm/metadata.jsonl
\
--output_dir
=
${
train_output_path
}
/test
\
--phones_dict
=
dump/phone_id_map.txt
fi
# hifigan
if
[
${
stage
}
-le
1
]
&&
[
${
stop_stage
}
-ge
1
]
;
then
FLAGS_allocator_strategy
=
naive_best_fit
\
FLAGS_fraction_of_gpu_memory_to_use
=
0.01
\
python3
${
BIN_DIR
}
/synthesize.py
\
--erniesat_config
=
${
config_path
}
\
--erniesat_ckpt
=
${
train_output_path
}
/checkpoints/
${
ckpt_name
}
\
--erniesat_stat
=
dump/train/speech_stats.npy
\
--voc
=
hifigan_vctk
\
--voc_config
=
hifigan_vctk_ckpt_0.2.0/default.yaml
\
--voc_ckpt
=
hifigan_vctk_ckpt_0.2.0/snapshot_iter_2500000.pdz
\
--voc_stat
=
hifigan_vctk_ckpt_0.2.0/feats_stats.npy
\
--test_metadata
=
dump/test/norm/metadata.jsonl
\
--output_dir
=
${
train_output_path
}
/test
\
--phones_dict
=
dump/phone_id_map.txt
fi
paddlespeech/t2s/datasets/am_batch_fn.py
浏览文件 @
5503c8bd
...
@@ -119,9 +119,17 @@ def erniesat_batch_fn(examples,
...
@@ -119,9 +119,17 @@ def erniesat_batch_fn(examples,
speech_mask
=
make_non_pad_mask
(
speech_mask
=
make_non_pad_mask
(
speech_lengths
,
speech_pad
[:,
:,
0
],
length_dim
=
1
).
unsqueeze
(
-
2
)
speech_lengths
,
speech_pad
[:,
:,
0
],
length_dim
=
1
).
unsqueeze
(
-
2
)
# for training
span_bdy
=
None
# for inference
if
'span_bdy'
in
examples
[
0
].
keys
():
span_bdy
=
[
np
.
array
(
item
[
"span_bdy"
],
dtype
=
np
.
int64
)
for
item
in
examples
]
span_bdy
=
paddle
.
to_tensor
(
span_bdy
)
# dual_mask 的是混合中英时候同时 mask 语音和文本
# dual_mask 的是混合中英时候同时 mask 语音和文本
# ernie sat 在实现跨语言的时候都 mask 了
# ernie sat 在实现跨语言的时候都 mask 了
span_bdy
=
None
if
text_masking
:
if
text_masking
:
masked_pos
,
text_masked_pos
=
phones_text_masking
(
masked_pos
,
text_masked_pos
=
phones_text_masking
(
xs_pad
=
speech_pad
,
xs_pad
=
speech_pad
,
...
...
paddlespeech/t2s/exps/ernie_sat/preprocess.py
浏览文件 @
5503c8bd
...
@@ -166,7 +166,8 @@ def process_sentences(config,
...
@@ -166,7 +166,8 @@ def process_sentences(config,
results
.
append
(
record
)
results
.
append
(
record
)
results
.
sort
(
key
=
itemgetter
(
"utt_id"
))
results
.
sort
(
key
=
itemgetter
(
"utt_id"
))
with
jsonlines
.
open
(
output_dir
/
"metadata.jsonl"
,
'w'
)
as
writer
:
# replace 'w' with 'a' to write from the end of file
with
jsonlines
.
open
(
output_dir
/
"metadata.jsonl"
,
'a'
)
as
writer
:
for
item
in
results
:
for
item
in
results
:
writer
.
write
(
item
)
writer
.
write
(
item
)
print
(
"Done"
)
print
(
"Done"
)
...
...
paddlespeech/t2s/exps/ernie_sat/synthesize.py
浏览文件 @
5503c8bd
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import
argparse
import
logging
from
pathlib
import
Path
import
jsonlines
import
numpy
as
np
import
paddle
import
soundfile
as
sf
import
yaml
from
yacs.config
import
CfgNode
from
paddlespeech.t2s.datasets.am_batch_fn
import
build_erniesat_collate_fn
from
paddlespeech.t2s.exps.syn_utils
import
denorm
from
paddlespeech.t2s.exps.syn_utils
import
get_am_inference
from
paddlespeech.t2s.exps.syn_utils
import
get_test_dataset
from
paddlespeech.t2s.exps.syn_utils
import
get_voc_inference
def
evaluate
(
args
):
# dataloader has been too verbose
logging
.
getLogger
(
"DataLoader"
).
disabled
=
True
# construct dataset for evaluation
with
jsonlines
.
open
(
args
.
test_metadata
,
'r'
)
as
reader
:
test_metadata
=
list
(
reader
)
# Init body.
with
open
(
args
.
erniesat_config
)
as
f
:
erniesat_config
=
CfgNode
(
yaml
.
safe_load
(
f
))
with
open
(
args
.
voc_config
)
as
f
:
voc_config
=
CfgNode
(
yaml
.
safe_load
(
f
))
print
(
"========Args========"
)
print
(
yaml
.
safe_dump
(
vars
(
args
)))
print
(
"========Config========"
)
print
(
erniesat_config
)
print
(
voc_config
)
# ernie sat model
erniesat_inference
=
get_am_inference
(
am
=
'erniesat_dataset'
,
am_config
=
erniesat_config
,
am_ckpt
=
args
.
erniesat_ckpt
,
am_stat
=
args
.
erniesat_stat
,
phones_dict
=
args
.
phones_dict
)
test_dataset
=
get_test_dataset
(
test_metadata
=
test_metadata
,
am
=
'erniesat_dataset'
)
# vocoder
voc_inference
=
get_voc_inference
(
voc
=
args
.
voc
,
voc_config
=
voc_config
,
voc_ckpt
=
args
.
voc_ckpt
,
voc_stat
=
args
.
voc_stat
)
output_dir
=
Path
(
args
.
output_dir
)
output_dir
.
mkdir
(
parents
=
True
,
exist_ok
=
True
)
collate_fn
=
build_erniesat_collate_fn
(
mlm_prob
=
erniesat_config
.
mlm_prob
,
mean_phn_span
=
erniesat_config
.
mean_phn_span
,
seg_emb
=
erniesat_config
.
model
[
'enc_input_layer'
]
==
'sega_mlm'
,
text_masking
=
False
,
epoch
=-
1
)
gen_raw
=
True
erniesat_mu
,
erniesat_std
=
np
.
load
(
args
.
erniesat_stat
)
for
datum
in
test_dataset
:
# collate function and dataloader
utt_id
=
datum
[
"utt_id"
]
speech_len
=
datum
[
"speech_lengths"
]
# mask the middle 1/3 speech
left_bdy
,
right_bdy
=
speech_len
//
3
,
2
*
speech_len
//
3
span_bdy
=
[
left_bdy
,
right_bdy
]
datum
.
update
({
"span_bdy"
:
span_bdy
})
batch
=
collate_fn
([
datum
])
with
paddle
.
no_grad
():
out_mels
=
erniesat_inference
(
speech
=
batch
[
"speech"
],
text
=
batch
[
"text"
],
masked_pos
=
batch
[
"masked_pos"
],
speech_mask
=
batch
[
"speech_mask"
],
text_mask
=
batch
[
"text_mask"
],
speech_seg_pos
=
batch
[
"speech_seg_pos"
],
text_seg_pos
=
batch
[
"text_seg_pos"
],
span_bdy
=
span_bdy
)
# vocoder
wav_list
=
[]
for
mel
in
out_mels
:
part_wav
=
voc_inference
(
mel
)
wav_list
.
append
(
part_wav
)
wav
=
paddle
.
concat
(
wav_list
)
wav
=
wav
.
numpy
()
if
gen_raw
:
speech
=
datum
[
'speech'
]
denorm_mel
=
denorm
(
speech
,
erniesat_mu
,
erniesat_std
)
denorm_mel
=
paddle
.
to_tensor
(
denorm_mel
)
wav_raw
=
voc_inference
(
denorm_mel
)
wav_raw
=
wav_raw
.
numpy
()
sf
.
write
(
str
(
output_dir
/
(
utt_id
+
".wav"
)),
wav
,
samplerate
=
erniesat_config
.
fs
)
if
gen_raw
:
sf
.
write
(
str
(
output_dir
/
(
utt_id
+
"_raw"
+
".wav"
)),
wav_raw
,
samplerate
=
erniesat_config
.
fs
)
print
(
f
"
{
utt_id
}
done!"
)
def
parse_args
():
# parse args and config
parser
=
argparse
.
ArgumentParser
(
description
=
"Synthesize with acoustic model & vocoder"
)
# ernie sat
parser
.
add_argument
(
'--erniesat_config'
,
type
=
str
,
default
=
None
,
help
=
'Config of acoustic model.'
)
parser
.
add_argument
(
'--erniesat_ckpt'
,
type
=
str
,
default
=
None
,
help
=
'Checkpoint file of acoustic model.'
)
parser
.
add_argument
(
"--erniesat_stat"
,
type
=
str
,
default
=
None
,
help
=
"mean and standard deviation used to normalize spectrogram when training acoustic model."
)
parser
.
add_argument
(
"--phones_dict"
,
type
=
str
,
default
=
None
,
help
=
"phone vocabulary file."
)
# vocoder
parser
.
add_argument
(
'--voc'
,
type
=
str
,
default
=
'pwgan_csmsc'
,
choices
=
[
'pwgan_aishell3'
,
'pwgan_vctk'
,
'hifigan_aishell3'
,
'hifigan_vctk'
,
],
help
=
'Choose vocoder type of tts task.'
)
parser
.
add_argument
(
'--voc_config'
,
type
=
str
,
default
=
None
,
help
=
'Config of voc.'
)
parser
.
add_argument
(
'--voc_ckpt'
,
type
=
str
,
default
=
None
,
help
=
'Checkpoint file of voc.'
)
parser
.
add_argument
(
"--voc_stat"
,
type
=
str
,
default
=
None
,
help
=
"mean and standard deviation used to normalize spectrogram when training voc."
)
# other
parser
.
add_argument
(
"--ngpu"
,
type
=
int
,
default
=
1
,
help
=
"if ngpu == 0, use cpu."
)
parser
.
add_argument
(
"--test_metadata"
,
type
=
str
,
help
=
"test metadata."
)
parser
.
add_argument
(
"--output_dir"
,
type
=
str
,
help
=
"output dir."
)
args
=
parser
.
parse_args
()
return
args
def
main
():
args
=
parse_args
()
if
args
.
ngpu
==
0
:
paddle
.
set_device
(
"cpu"
)
elif
args
.
ngpu
>
0
:
paddle
.
set_device
(
"gpu"
)
else
:
print
(
"ngpu should >= 0 !"
)
evaluate
(
args
)
if
__name__
==
"__main__"
:
main
()
paddlespeech/t2s/exps/ernie_sat/train.py
浏览文件 @
5503c8bd
...
@@ -62,8 +62,6 @@ def train_sp(args, config):
...
@@ -62,8 +62,6 @@ def train_sp(args, config):
"align_end"
"align_end"
]
]
converters
=
{
"speech"
:
np
.
load
}
converters
=
{
"speech"
:
np
.
load
}
spk_num
=
None
# dataloader has been too verbose
# dataloader has been too verbose
logging
.
getLogger
(
"DataLoader"
).
disabled
=
True
logging
.
getLogger
(
"DataLoader"
).
disabled
=
True
...
...
paddlespeech/t2s/exps/syn_utils.py
浏览文件 @
5503c8bd
...
@@ -68,6 +68,10 @@ model_alias = {
...
@@ -68,6 +68,10 @@ model_alias = {
"paddlespeech.t2s.models.wavernn:WaveRNN"
,
"paddlespeech.t2s.models.wavernn:WaveRNN"
,
"wavernn_inference"
:
"wavernn_inference"
:
"paddlespeech.t2s.models.wavernn:WaveRNNInference"
,
"paddlespeech.t2s.models.wavernn:WaveRNNInference"
,
"erniesat"
:
"paddlespeech.t2s.models.ernie_sat:ErnieSAT"
,
"erniesat_inference"
:
"paddlespeech.t2s.models.ernie_sat:ErnieSATInference"
,
}
}
...
@@ -109,6 +113,7 @@ def get_test_dataset(test_metadata: List[Dict[str, Any]],
...
@@ -109,6 +113,7 @@ def get_test_dataset(test_metadata: List[Dict[str, Any]],
# model: {model_name}_{dataset}
# model: {model_name}_{dataset}
am_name
=
am
[:
am
.
rindex
(
'_'
)]
am_name
=
am
[:
am
.
rindex
(
'_'
)]
am_dataset
=
am
[
am
.
rindex
(
'_'
)
+
1
:]
am_dataset
=
am
[
am
.
rindex
(
'_'
)
+
1
:]
converters
=
{}
if
am_name
==
'fastspeech2'
:
if
am_name
==
'fastspeech2'
:
fields
=
[
"utt_id"
,
"text"
]
fields
=
[
"utt_id"
,
"text"
]
if
am_dataset
in
{
"aishell3"
,
"vctk"
}
and
speaker_dict
is
not
None
:
if
am_dataset
in
{
"aishell3"
,
"vctk"
}
and
speaker_dict
is
not
None
:
...
@@ -126,8 +131,17 @@ def get_test_dataset(test_metadata: List[Dict[str, Any]],
...
@@ -126,8 +131,17 @@ def get_test_dataset(test_metadata: List[Dict[str, Any]],
if
voice_cloning
:
if
voice_cloning
:
print
(
"voice cloning!"
)
print
(
"voice cloning!"
)
fields
+=
[
"spk_emb"
]
fields
+=
[
"spk_emb"
]
elif
am_name
==
'erniesat'
:
fields
=
[
"utt_id"
,
"text"
,
"text_lengths"
,
"speech"
,
"speech_lengths"
,
"align_start"
,
"align_end"
]
converters
=
{
"speech"
:
np
.
load
}
else
:
print
(
"wrong am, please input right am!!!"
)
test_dataset
=
DataTable
(
data
=
test_metadata
,
fields
=
fields
)
test_dataset
=
DataTable
(
data
=
test_metadata
,
fields
=
fields
,
converters
=
converters
)
return
test_dataset
return
test_dataset
...
@@ -193,6 +207,10 @@ def get_am_inference(am: str='fastspeech2_csmsc',
...
@@ -193,6 +207,10 @@ def get_am_inference(am: str='fastspeech2_csmsc',
**
am_config
[
"model"
])
**
am_config
[
"model"
])
elif
am_name
==
'tacotron2'
:
elif
am_name
==
'tacotron2'
:
am
=
am_class
(
idim
=
vocab_size
,
odim
=
odim
,
**
am_config
[
"model"
])
am
=
am_class
(
idim
=
vocab_size
,
odim
=
odim
,
**
am_config
[
"model"
])
elif
am_name
==
'erniesat'
:
am
=
am_class
(
idim
=
vocab_size
,
odim
=
odim
,
**
am_config
[
"model"
])
else
:
print
(
"wrong am, please input right am!!!"
)
am
.
set_state_dict
(
paddle
.
load
(
am_ckpt
)[
"main_params"
])
am
.
set_state_dict
(
paddle
.
load
(
am_ckpt
)[
"main_params"
])
am
.
eval
()
am
.
eval
()
...
...
paddlespeech/t2s/models/ernie_sat/ernie_sat.py
浏览文件 @
5503c8bd
...
@@ -389,7 +389,7 @@ class MLM(nn.Layer):
...
@@ -389,7 +389,7 @@ class MLM(nn.Layer):
speech_seg_pos
:
paddle
.
Tensor
,
speech_seg_pos
:
paddle
.
Tensor
,
text_seg_pos
:
paddle
.
Tensor
,
text_seg_pos
:
paddle
.
Tensor
,
span_bdy
:
List
[
int
],
span_bdy
:
List
[
int
],
use_teacher_forcing
:
bool
=
False
,
)
->
Dict
[
str
,
paddle
.
Tensor
]:
use_teacher_forcing
:
bool
=
False
,
)
->
List
[
paddle
.
Tensor
]:
'''
'''
Args:
Args:
speech (paddle.Tensor): input speech (1, Tmax, D).
speech (paddle.Tensor): input speech (1, Tmax, D).
...
@@ -668,3 +668,38 @@ class ErnieSAT(nn.Layer):
...
@@ -668,3 +668,38 @@ class ErnieSAT(nn.Layer):
text_seg_pos
=
text_seg_pos
,
text_seg_pos
=
text_seg_pos
,
span_bdy
=
span_bdy
,
span_bdy
=
span_bdy
,
use_teacher_forcing
=
use_teacher_forcing
)
use_teacher_forcing
=
use_teacher_forcing
)
class
ErnieSATInference
(
nn
.
Layer
):
def
__init__
(
self
,
normalizer
,
model
):
super
().
__init__
()
self
.
normalizer
=
normalizer
self
.
acoustic_model
=
model
def
forward
(
self
,
speech
:
paddle
.
Tensor
,
text
:
paddle
.
Tensor
,
masked_pos
:
paddle
.
Tensor
,
speech_mask
:
paddle
.
Tensor
,
text_mask
:
paddle
.
Tensor
,
speech_seg_pos
:
paddle
.
Tensor
,
text_seg_pos
:
paddle
.
Tensor
,
span_bdy
:
List
[
int
],
use_teacher_forcing
:
bool
=
True
,
):
outs
=
self
.
acoustic_model
.
inference
(
speech
=
speech
,
text
=
text
,
masked_pos
=
masked_pos
,
speech_mask
=
speech_mask
,
text_mask
=
text_mask
,
speech_seg_pos
=
speech_seg_pos
,
text_seg_pos
=
text_seg_pos
,
span_bdy
=
span_bdy
,
use_teacher_forcing
=
use_teacher_forcing
)
normed_mel_pre
,
normed_mel_masked
,
normed_mel_post
=
outs
logmel_pre
=
self
.
normalizer
.
inverse
(
normed_mel_pre
)
logmel_masked
=
self
.
normalizer
.
inverse
(
normed_mel_masked
)
logmel_post
=
self
.
normalizer
.
inverse
(
normed_mel_post
)
return
logmel_pre
,
logmel_masked
,
logmel_post
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录