Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
PaddlePaddle
DeepSpeech
提交
c2e6378a
D
DeepSpeech
项目概览
PaddlePaddle
/
DeepSpeech
大约 2 年 前同步成功
通知
210
Star
8425
Fork
1598
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
245
列表
看板
标记
里程碑
合并请求
3
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
D
DeepSpeech
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
245
Issue
245
列表
看板
标记
里程碑
合并请求
3
合并请求
3
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
提交
c2e6378a
编写于
8月 09, 2017
作者:
Y
yangyaming
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
Simplify codes and comments.
上级
1325cd9b
变更
2
隐藏空白更改
内联
并排
Showing
2 changed file
with
17 addition
and
18 deletion
+17
-18
tools/_init_paths.py
tools/_init_paths.py
+3
-0
tools/build_vocab.py
tools/build_vocab.py
+14
-18
未找到文件。
tools/_init_paths.py
浏览文件 @
c2e6378a
"""Set up paths for DS2"""
from
__future__
import
absolute_import
from
__future__
import
division
from
__future__
import
print_function
import
os.path
import
sys
...
...
tools/build_vocab.py
浏览文件 @
c2e6378a
"""Build vocabulary
dictionary
from manifest files.
"""Build vocabulary from manifest files.
Each item in vocabulary file is a character.
"""
...
...
@@ -11,13 +11,14 @@ import codecs
import
json
from
collections
import
Counter
import
os.path
import
_init_paths
from
data_utils
import
utils
parser
=
argparse
.
ArgumentParser
(
description
=
'Build vocabulary dictionary from transcription texts.'
)
parser
=
argparse
.
ArgumentParser
(
description
=
__doc__
)
parser
.
add_argument
(
"--manifest_paths"
,
type
=
str
,
help
=
"Manifest paths for building vocabulary
dictionary
."
help
=
"Manifest paths for building vocabulary."
"You can provide multiple manifest files."
,
nargs
=
'+'
,
required
=
True
)
...
...
@@ -25,25 +26,20 @@ parser.add_argument(
"--count_threshold"
,
default
=
0
,
type
=
int
,
help
=
"Characters whose count below the threshold will be truncated. "
"(default: %(default)
s
)"
)
help
=
"Characters whose count
s are
below the threshold will be truncated. "
"(default: %(default)
i
)"
)
parser
.
add_argument
(
"--vocab_path"
,
default
=
'datasets/vocab/zh_vocab.txt'
,
type
=
str
,
help
=
"File
path to write vocabularies
. (default: %(default)s)"
)
help
=
"File
path to write the vocabulary
. (default: %(default)s)"
)
args
=
parser
.
parse_args
()
def
count_manifest
(
counter
,
manifest_path
):
for
json_line
in
codecs
.
open
(
manifest_path
,
'r'
,
'utf-8'
):
try
:
json_data
=
json
.
loads
(
json_line
)
except
Exception
as
e
:
raise
Exception
(
'Error parsing manifest: %s, %s'
%
\
(
manifest_path
,
e
))
text
=
json_data
[
'text'
]
for
char
in
text
:
manifest_jsons
=
utils
.
read_manifest
(
manifest_path
)
for
line_json
in
manifest_jsons
:
for
char
in
line_json
[
'text'
]:
counter
.
update
(
char
)
...
...
@@ -54,9 +50,9 @@ def main():
count_sorted
=
sorted
(
counter
.
items
(),
key
=
lambda
x
:
x
[
1
],
reverse
=
True
)
with
codecs
.
open
(
args
.
vocab_path
,
'w'
,
'utf-8'
)
as
fout
:
for
item_pair
in
count_sorted
:
if
item_pair
[
1
]
<
args
.
count_threshold
:
break
fout
.
write
(
item_pair
[
0
]
+
'
\n
'
)
for
char
,
count
in
count_sorted
:
if
count
<
args
.
count_threshold
:
break
fout
.
write
(
char
+
'
\n
'
)
if
__name__
==
'__main__'
:
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录