Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
DeepSpeech
提交
4c3c5546
D
DeepSpeech
项目概览
PaddlePaddle
/
DeepSpeech
1 年多 前同步成功
通知
208
Star
8425
Fork
1598
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
245
列表
看板
标记
里程碑
合并请求
3
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
D
DeepSpeech
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
245
Issue
245
列表
看板
标记
里程碑
合并请求
3
合并请求
3
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
提交
4c3c5546
编写于
5月 25, 2021
作者:
C
chenfeiyu
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
1. use space as separator;
2. add docstring for some functions.
上级
b373a254
变更
3
隐藏空白更改
内联
并排
Showing
3 changed file
with
102 addition
and
2 deletion
+102
-2
examples/chinese_g2p/local/convert_transcription.py
examples/chinese_g2p/local/convert_transcription.py
+1
-1
examples/chinese_g2p/local/extract_pinyin_label.py
examples/chinese_g2p/local/extract_pinyin_label.py
+1
-1
examples/chinese_g2p/local/ignore_sandhi.py
examples/chinese_g2p/local/ignore_sandhi.py
+100
-0
未找到文件。
examples/chinese_g2p/local/convert_transcription.py
浏览文件 @
4c3c5546
...
...
@@ -34,7 +34,7 @@ def extract_pinyin(source, target, use_jieba=False):
style
=
Style
.
TONE3
,
neutral_tone_with_five
=
True
)
transcription
=
' '
.
join
(
syllables
)
fout
.
write
(
f
'
{
sentence_id
}
\t
{
transcription
}
\n
'
)
fout
.
write
(
f
'
{
sentence_id
}
{
transcription
}
\n
'
)
else
:
continue
...
...
examples/chinese_g2p/local/extract_pinyin_label.py
浏览文件 @
4c3c5546
...
...
@@ -21,7 +21,7 @@ def extract_pinyin_lables(source, target):
for
i
,
line
in
enumerate
(
fin
):
if
i
%
2
==
0
:
sentence_id
,
raw_text
=
line
.
strip
().
split
()
fout
.
write
(
f
'
{
sentence_id
}
\t
'
)
fout
.
write
(
f
'
{
sentence_id
}
'
)
else
:
transcription
=
line
.
strip
()
fout
.
write
(
f
'
{
transcription
}
\n
'
)
...
...
examples/chinese_g2p/local/ignore_sandhi.py
0 → 100644
浏览文件 @
4c3c5546
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import
argparse
from
typing
import
List
,
Union
from
pathlib
import
Path
def
erized
(
syllable
:
str
)
->
bool
:
"""Whether the syllable contains erhua effect.
Example
--------
huar -> True
guanr -> True
er -> False
"""
# note: for pinyin, len(syllable) >=2 is always true
# if not: there is something wrong in the data
assert
len
(
syllable
)
>=
2
,
f
"inavlid syllable
{
syllable
}
"
return
syllable
[:
2
]
!=
"er"
and
syllable
[
-
2
]
==
'r'
def
ignore_sandhi
(
reference
:
List
[
str
],
generated
:
List
[
str
])
->
List
[
str
]:
"""
Given a sequence of syllables from human annotation(reference),
which makes sandhi explici and a sequence of syllables from some
simple g2p program(generated), which does not consider sandhi,
return a the reference sequence while ignore sandhi.
Example
--------
['lao2', 'hu3'], ['lao3', 'hu3'] -> ['lao3', 'hu3']
"""
i
=
0
j
=
0
# sandhi ignored in the result while other errors are not included
result
=
[]
while
i
<
len
(
reference
):
if
erized
(
reference
[
i
]):
result
.
append
(
reference
[
i
])
i
+=
1
j
+=
2
elif
reference
[
i
][:
-
1
]
==
generated
[
i
][:
-
1
]
and
reference
[
i
][
-
1
]
==
'2'
and
generated
[
i
][
-
1
]
==
'3'
:
result
.
append
(
generated
[
i
])
i
+=
1
j
+=
1
else
:
result
.
append
(
reference
[
i
])
i
+=
1
j
+=
1
assert
j
==
len
(
generated
),
"length of transcriptions mismatch, There may be some characters that are ignored in the generated transcription."
return
result
def
convert_transcriptions
(
reference
:
Union
[
str
,
Path
],
generated
:
Union
[
str
,
Path
],
output
:
Union
[
str
,
Path
]):
with
open
(
reference
,
'rt'
)
as
f_ref
:
with
open
(
generated
,
'rt'
)
as
f_gen
:
with
open
(
output
,
'wt'
)
as
f_out
:
for
i
,
(
ref
,
gen
)
in
enumerate
(
zip
(
f_ref
,
f_gen
)):
sentence_id
,
ref_transcription
=
ref
.
strip
().
split
(
' '
,
1
)
_
,
gen_transcription
=
gen
.
strip
().
split
(
' '
,
1
)
try
:
result
=
ignore_sandhi
(
ref_transcription
.
split
(),
gen_transcription
.
split
())
result
=
' '
.
join
(
result
)
except
Exception
:
print
(
f
"sentence_id:
{
sentence_id
}
There is some annotation error in the reference or generated transcription. Use the reference."
)
result
=
ref_transcription
f_out
.
write
(
f
"
{
sentence_id
}
{
result
}
\n
"
)
if
__name__
==
"__main__"
:
parser
=
argparse
.
ArgumentParser
(
description
=
"reference transcription but ignore sandhi."
)
parser
.
add_argument
(
"--reference"
,
type
=
str
,
help
=
"path to the reference transcription of baker dataset."
)
parser
.
add_argument
(
"--generated"
,
type
=
str
,
help
=
"path to the generated transcription."
)
parser
.
add_argument
(
"--output"
,
type
=
str
,
help
=
"path to save result."
)
args
=
parser
.
parse_args
()
convert_transcriptions
(
args
.
reference
,
args
.
generated
,
args
.
output
)
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录