Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
DeepSpeech
提交
8f3280af
D
DeepSpeech
项目概览
PaddlePaddle
/
DeepSpeech
大约 1 年 前同步成功
通知
207
Star
8425
Fork
1598
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
245
列表
看板
标记
里程碑
合并请求
3
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
D
DeepSpeech
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
245
Issue
245
列表
看板
标记
里程碑
合并请求
3
合并请求
3
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
体验新版 GitCode,发现更多精彩内容 >>
提交
8f3280af
编写于
11月 29, 2021
作者:
J
Junkun
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
fix data process
上级
cdd08451
变更
5
隐藏空白更改
内联
并排
Showing
5 changed file
with
183 addition
and
24 deletion
+183
-24
examples/ted_en_zh/st1/local/data.sh
examples/ted_en_zh/st1/local/data.sh
+29
-24
examples/ted_en_zh/st1/local/data_prep.sh
examples/ted_en_zh/st1/local/data_prep.sh
+54
-0
examples/ted_en_zh/st1/local/divide_lang.sh
examples/ted_en_zh/st1/local/divide_lang.sh
+48
-0
examples/ted_en_zh/st1/local/espnet_json_to_manifest.py
examples/ted_en_zh/st1/local/espnet_json_to_manifest.py
+27
-0
examples/ted_en_zh/st1/local/remove_punctuation.pl
examples/ted_en_zh/st1/local/remove_punctuation.pl
+25
-0
未找到文件。
examples/ted_en_zh/st1/local/data.sh
浏览文件 @
8f3280af
...
...
@@ -2,7 +2,7 @@
set
-e
stage
=
1
stage
=
3
stop_stage
=
100
dict_dir
=
data/lang_char
...
...
@@ -14,6 +14,7 @@ data_dir=./TED_EnZh
target_dir
=
data/ted_en_zh
dumpdir
=
data/dump
do_delta
=
false
nj
=
20
source
${
MAIN_ROOT
}
/utils/parse_options.sh
...
...
@@ -40,11 +41,11 @@ if [ ${stage} -le -1 ] && [ ${stop_stage} -ge -1 ]; then
exit
1
fi
#
#
extract data
#
echo "data Extraction"
#
python3 local/ted_en_zh.py \
#
--tgt-dir=${target_dir} \
#
--src-dir=${data_dir}
# extract data
echo
"data Extraction"
python3
local
/ted_en_zh.py
\
--tgt-dir
=
${
target_dir
}
\
--src-dir
=
${
data_dir
}
fi
prep_dir
=
${
target_dir
}
/data_prep
...
...
@@ -99,7 +100,7 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
done
fi
feat_tr_dir
=
${
dumpdir
}
/train/delta
${
do_delta
}
;
mkdir
-p
${
feat_tr_dir
}
feat_tr_dir
=
${
dumpdir
}
/train
_sp
/delta
${
do_delta
}
;
mkdir
-p
${
feat_tr_dir
}
feat_dt_dir
=
${
dumpdir
}
/dev/delta
${
do_delta
}
;
mkdir
-p
${
feat_dt_dir
}
feat_trans_dir
=
${
dumpdir
}
/test/delta
${
do_delta
}
;
mkdir
-p
${
feat_trans_dir
}
if
[
${
stage
}
-le
1
]
&&
[
${
stop_stage
}
-ge
1
]
;
then
...
...
@@ -109,7 +110,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
fbankdir
=
data/fbank
# Generate the fbank features; by default 80-dimensional fbanks with pitch on each frame
for
x
in
train dev
test
;
do
steps/make_fbank_pitch.sh
--cmd
"
$train_cmd
"
--nj
32
--write_utt2num_frames
true
\
steps/make_fbank_pitch.sh
--cmd
"
$train_cmd
"
--nj
${
nj
}
--write_utt2num_frames
true
\
${
prep_dir
}
/
${
x
}
.en-zh data/make_fbank/
${
x
}
${
fbankdir
}
done
...
...
@@ -123,7 +124,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
rm
-r
${
prep_dir
}
/temp
*
.en-zh
utils/fix_data_dir.sh
${
prep_dir
}
/train_sp.en-zh
steps/make_fbank_pitch.sh
--cmd
"
$train_cmd
"
--nj
32
--write_utt2num_frames
true
\
steps/make_fbank_pitch.sh
--cmd
"
$train_cmd
"
--nj
${
nj
}
--write_utt2num_frames
true
\
${
prep_dir
}
/train_sp.en-zh exp/make_fbank/train_sp.en-zh
${
fbankdir
}
for
lang
in
en zh
;
do
...
...
@@ -155,14 +156,14 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
rm
-rf
${
prep_dir
}
/
${
x
}
.en-zh.
*
.tmp
done
compute-cmvn-stats scp:
${
prep_dir
}
/train_sp.en-zh
/feats.scp
${
prep_dir
}
/train_sp.en-
zh/cmvn.ark
compute-cmvn-stats scp:
${
prep_dir
}
/train_sp.en-zh
.zh/feats.scp
${
prep_dir
}
/train_sp.en-zh.
zh/cmvn.ark
dump.sh
--cmd
"
$train_cmd
"
--nj
80
--do_delta
$do_delta
\
${
prep_dir
}
/train_sp.en-zh
/feats.scp
${
prep_dir
}
/train_sp.en-zh/cmvn.ark
${
prep_dir
}
/dump_feats/train_sp.en-
zh
${
feat_tr_dir
}
dump.sh
--cmd
"
$train_cmd
"
--nj
32
--do_delta
$do_delta
\
${
prep_dir
}
/dev.en-zh
/feats.scp
${
prep_dir
}
/train_sp.en-zh/cmvn.ark
${
prep_dir
}
/dump_feats/dev.en-
zh
${
feat_dt_dir
}
dump.sh
--cmd
"
$train_cmd
"
--nj
32
--do_delta
$do_delta
\
${
prep_dir
}
/test.en-zh
/feats.scp
${
prep_dir
}
/train_sp.en-zh/cmvn.ark
${
prep_dir
}
/dump_feats/test.en-
zh
${
feat_trans_dir
}
dump.sh
--cmd
"
$train_cmd
"
--nj
${
nj
}
--do_delta
$do_delta
\
${
prep_dir
}
/train_sp.en-zh
.zh/feats.scp
${
prep_dir
}
/train_sp.en-zh.zh/cmvn.ark
${
prep_dir
}
/dump_feats/train_sp.en-zh.
zh
${
feat_tr_dir
}
dump.sh
--cmd
"
$train_cmd
"
--nj
${
nj
}
--do_delta
$do_delta
\
${
prep_dir
}
/dev.en-zh
.zh/feats.scp
${
prep_dir
}
/train_sp.en-zh.zh/cmvn.ark
${
prep_dir
}
/dump_feats/dev.en-zh.
zh
${
feat_dt_dir
}
dump.sh
--cmd
"
$train_cmd
"
--nj
${
nj
}
--do_delta
$do_delta
\
${
prep_dir
}
/test.en-zh
.zh/feats.scp
${
prep_dir
}
/train_sp.en-zh.zh/cmvn.ark
${
prep_dir
}
/dump_feats/test.en-zh.
zh
${
feat_trans_dir
}
fi
dict
=
${
dict_dir
}
/ted_en_zh_
${
bpemode
}${
nbpe
}
_joint.txt
...
...
@@ -170,9 +171,6 @@ nlsyms=${dict_dir}/ted_en_zh_non_lang_syms.txt
bpemodel
=
${
dict_dir
}
/ted_en_zh_
${
bpemode
}${
nbpe
}
if
[
${
stage
}
-le
2
]
&&
[
${
stop_stage
}
-ge
2
]
;
then
echo
"stage 2: Dictionary and Json Data Preparation"
# echo "make a non-linguistic symbol list for all languages"
# grep sp1.0 ${prep_dir}/train_sp.en-zh.*/text | cut -f 2- -d' ' | grep -o -P '&[^;];'| sort | uniq > ${nlsyms}
# cat ${nlsyms}
echo
"make a joint source and target dictionary"
echo
"<unk> 1"
>
${
dict
}
# <unk> must be 1, 0 will be used for "blank" in CTC
...
...
@@ -183,20 +181,27 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
wc
-l
${
dict
}
echo
"make json files"
data2json.sh
--nj
16
--feat
${
feat_tr_dir
}
/feats.scp
--text
${
prep_dir
}
/train_sp.en-zh.zh/text
--bpecode
${
bpemodel
}
.model
--lang
zh
\
data2json.sh
--nj
${
nj
}
--feat
${
feat_tr_dir
}
/feats.scp
--text
${
prep_dir
}
/train_sp.en-zh.zh/text
--bpecode
${
bpemodel
}
.model
--lang
zh
\
${
prep_dir
}
/train_sp.en-zh.zh
${
dict
}
>
${
feat_tr_dir
}
/data_
${
bpemode
}${
nbpe
}
.json
data2json.sh
--feat
${
feat_dt_dir
}
/feats.scp
--text
${
prep_dir
}
/dev.en-zh.zh/text
--bpecode
${
bpemodel
}
.model
--lang
zh
\
${
prep_dir
}
/dev.en-zh.zh
${
dict
}
>
${
feat_dt_dir
}
/data_
${
bpemode
}${
nbpe
}
.json
data2json.sh
--feat
${
feat_
dt
_dir
}
/feats.scp
--text
${
prep_dir
}
/test.en-zh.zh/text
--bpecode
${
bpemodel
}
.model
--lang
zh
\
data2json.sh
--feat
${
feat_
trans
_dir
}
/feats.scp
--text
${
prep_dir
}
/test.en-zh.zh/text
--bpecode
${
bpemodel
}
.model
--lang
zh
\
${
prep_dir
}
/test.en-zh.zh
${
dict
}
>
${
feat_trans_dir
}
/data_
${
bpemode
}${
nbpe
}
.json
echo
"update json (add source references)"
# update json (add source references)
for
x
in
${
train_set
}
${
train_dev
}
;
do
for
x
in
train_sp dev
;
do
feat_dir
=
${
dumpdir
}
/
${
x
}
/delta
${
do_delta
}
data_dir
=
data
/
$(
echo
${
x
}
|
cut
-f
1
-d
"."
)
.en-zh.en
update_json.sh
--text
${
data_dir
}
/text
.
${
src_case
}
--bpecode
${
bpemodel
}
.model
\
data_dir
=
${
prep_dir
}
/
$(
echo
${
x
}
|
cut
-f
1
-d
"."
)
.en-zh.en
update_json.sh
--text
${
data_dir
}
/text
--bpecode
${
bpemodel
}
.model
\
${
feat_dir
}
/data_
${
bpemode
}${
nbpe
}
.json
${
data_dir
}
${
dict
}
done
fi
if
[
${
stage
}
-le
3
]
&&
[
${
stop_stage
}
-ge
3
]
;
then
echo
"stage 3: Format the Json Data"
python3
local
/espnet_json_to_manifest.py
--json-file
${
feat_tr_dir
}
/data_
${
bpemode
}${
nbpe
}
.json
--manifest-file
data/manifest.train
python3
local
/espnet_json_to_manifest.py
--json-file
${
feat_dt_dir
}
/data_
${
bpemode
}${
nbpe
}
.json
--manifest-file
data/manifest.dev
python3
local
/espnet_json_to_manifest.py
--json-file
${
feat_trans_dir
}
/data_
${
bpemode
}${
nbpe
}
.json
--manifest-file
data/manifest.test
fi
echo
"Ted En-Zh Data preparation done."
exit
0
examples/ted_en_zh/st1/local/data_prep.sh
0 → 100755
浏览文件 @
8f3280af
#!/bin/bash
# Copyright 2019 Kyoto University (Hirofumi Inaguma)
# Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0)
export
LC_ALL
=
C
data_dir
=
${
1
}
for
set
in
train dev
test
;
do
# for set in train; do
dst
=
${
target_dir
}
/
${
set
}
for
lang
in
en zh
;
do
if
[
${
lang
}
=
'en'
]
;
then
echo
"remove punctuation
$lang
"
# remove punctuation
local
/remove_punctuation.pl <
${
dst
}
/
${
lang
}
.org
>
${
dst
}
/
${
lang
}
.raw
else
cp
${
dst
}
/
${
lang
}
.org
${
dst
}
/
${
lang
}
.raw
fi
paste
-d
" "
${
dst
}
/.yaml
${
dst
}
/
${
lang
}
.raw |
sort
>
${
dst
}
/text.
${
lang
}
done
# error check
n
=
$(
cat
${
dst
}
/.yaml |
wc
-l
)
n_en
=
$(
cat
${
dst
}
/en.raw |
wc
-l
)
n_tgt
=
$(
cat
${
dst
}
/zh.raw |
wc
-l
)
[
${
n
}
-ne
${
n_en
}
]
&&
echo
"Warning: expected
${
n
}
data data files, found
${
n_en
}
"
&&
exit
1
;
[
${
n
}
-ne
${
n_tgt
}
]
&&
echo
"Warning: expected
${
n
}
data data files, found
${
n_tgt
}
"
&&
exit
1
;
echo
"done text processing"
cat
${
dst
}
/wav.scp.org |
uniq
|
sort
-k1
,1
-u
>
${
dst
}
/wav.scp
cat
${
dst
}
/utt2spk.org |
uniq
|
sort
-k1
,1
-u
>
${
dst
}
/utt2spk
cat
${
dst
}
/utt2spk | utt2spk_to_spk2utt.pl |
sort
-k1
,1
-u
>
${
dst
}
/spk2utt
rm
-rf
${
target_dir
}
/data_prep/
${
set
}
.en-zh
mkdir
-p
${
target_dir
}
/data_prep/
${
set
}
.en-zh
echo
"remove duplicate lines..."
cut
-d
' '
-f
1
${
dst
}
/text.en |
sort
|
uniq
-c
|
sort
-n
-k1
-r
|
grep
-v
'1 ted-en-zh'
\
|
sed
's/^[ \t]*//'
>
${
dst
}
/duplicate_lines
cut
-d
' '
-f
1
${
dst
}
/text.en |
sort
|
uniq
-c
|
sort
-n
-k1
-r
|
grep
'1 ted-en-zh'
\
|
cut
-d
'1'
-f
2- |
sed
's/^[ \t]*//'
>
${
dst
}
/reclist
reduce_data_dir.sh
${
dst
}
${
dst
}
/reclist
${
target_dir
}
/data_prep/
${
set
}
.en-zh
echo
"done wav processing"
for
l
in
en zh
;
do
cp
${
dst
}
/text.
${
l
}
${
target_dir
}
/data_prep/
${
set
}
.en-zh/text.
${
l
}
done
fix_data_dir.sh
--utt_extra_files
\
"text.en text.zh"
\
${
target_dir
}
/data_prep/
${
set
}
.en-zh
done
\ No newline at end of file
examples/ted_en_zh/st1/local/divide_lang.sh
0 → 100755
浏览文件 @
8f3280af
#!/bin/bash
# Copyright 2019 Kyoto University (Hirofumi Inaguma)
# 2021 PaddlePaddle
# Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0)
.
./path.sh
if
[
"$#"
-ne
2
]
;
then
echo
"Usage:
$0
<set> <lang>>"
echo
"e.g.:
$0
dev"
exit
1
fi
set
=
$1
lang
=
$2
export
LC_ALL
=
en_US.UTF-8
# Copy stuff intoc its final locations [this has been moved from the format_data script]
# for En
mkdir
-p
${
set
}
.en
for
f
in
spk2utt utt2spk segments wav.scp feats.scp utt2num_frames
;
do
if
[
-f
${
set
}
/
${
f
}
]
;
then
sort
${
set
}
/
${
f
}
>
${
set
}
.en/
${
f
}
fi
done
sort
${
set
}
/text.en |
sed
$'s/[^[:print:]]//g'
>
${
set
}
.en/text
utils/fix_data_dir.sh
${
set
}
.en
if
[
-f
${
set
}
.en/feats.scp
]
;
then
utils/validate_data_dir.sh
${
set
}
.en
||
exit
1
;
else
utils/validate_data_dir.sh
--no-feats
--no-wav
${
set
}
.en
||
exit
1
;
fi
# for target language
mkdir
-p
${
set
}
.
${
lang
}
for
f
in
spk2utt utt2spk segments wav.scp feats.scp utt2num_frames
;
do
if
[
-f
${
set
}
/
${
f
}
]
;
then
sort
${
set
}
/
${
f
}
>
${
set
}
.
${
lang
}
/
${
f
}
fi
done
sort
${
set
}
/text.
${
lang
}
|
sed
$'s/[^[:print:]]//g'
>
${
set
}
.
${
lang
}
/text
utils/fix_data_dir.sh
${
set
}
.
${
lang
}
if
[
-f
${
set
}
.
${
lang
}
/feats.scp
]
;
then
utils/validate_data_dir.sh
${
set
}
.
${
lang
}
||
exit
1
;
else
utils/validate_data_dir.sh
--no-feats
--no-wav
${
set
}
.
${
lang
}
||
exit
1
;
fi
examples/ted_en_zh/st1/local/espnet_json_to_manifest.py
0 → 100644
浏览文件 @
8f3280af
#!/usr/bin/env python
import
argparse
import
json
def
main
(
args
):
with
open
(
args
.
json_file
,
'r'
)
as
fin
:
data_json
=
json
.
load
(
fin
)
with
open
(
args
.
manifest_file
,
'w'
)
as
fout
:
for
key
,
value
in
data_json
[
'utts'
].
items
():
value
[
'utt'
]
=
key
fout
.
write
(
json
.
dumps
(
value
,
ensure_ascii
=
False
))
fout
.
write
(
"
\n
"
)
if
__name__
==
'__main__'
:
parser
=
argparse
.
ArgumentParser
(
description
=
__doc__
)
parser
.
add_argument
(
'--json-file'
,
type
=
str
,
default
=
None
,
help
=
"espnet data json file."
)
parser
.
add_argument
(
'--manifest-file'
,
type
=
str
,
default
=
'manifest.train'
,
help
=
'manifest data json line file.'
)
args
=
parser
.
parse_args
()
main
(
args
)
examples/ted_en_zh/st1/local/remove_punctuation.pl
0 → 100755
浏览文件 @
8f3280af
#!/usr/bin/perl
use
warnings
;
use
strict
;
binmode
(
STDIN
,"
:utf8
");
binmode
(
STDOUT
,"
:utf8
");
while
(
<
STDIN
>
)
{
$_
=
"
$_
";
# remove punctuation except apostrophe
s/<space>/spacemark/g
;
# for scoring
s/'/apostrophe/g
;
s/[[:punct:]]//g
;
s/apostrophe/'/g
;
s/spacemark/<space>/g
;
# for scoring
# remove whitespace
s/\s+/ /g
;
s/^\s+//
;
s/\s+$//
;
print
"
$_
\n
";
}
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录