Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
机器未来
Paddle
提交
4b5a4322
P
Paddle
项目概览
机器未来
/
Paddle
与 Fork 源项目一致
Fork自
PaddlePaddle / Paddle
通知
1
Star
1
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
1
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
Paddle
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
1
Issue
1
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
提交
4b5a4322
编写于
4月 10, 2017
作者:
Y
Yu Yang
提交者:
GitHub
4月 10, 2017
浏览文件
操作
浏览文件
下载
差异文件
Merge pull request #1763 from reyoung/feature/remove_unnecessary_code_in_dataset
Remove unecessary code to generate freq_dict.
上级
4a99c441
270c0c5f
变更
3
隐藏空白更改
内联
并排
Showing
3 changed file
with
9 addition
and
15 deletion
+9
-15
python/paddle/v2/dataset/common.py
python/paddle/v2/dataset/common.py
+0
-7
python/paddle/v2/dataset/imdb.py
python/paddle/v2/dataset/imdb.py
+3
-2
python/paddle/v2/dataset/imikolov.py
python/paddle/v2/dataset/imikolov.py
+6
-6
未找到文件。
python/paddle/v2/dataset/common.py
浏览文件 @
4b5a4322
...
@@ -66,13 +66,6 @@ def download(url, module_name, md5sum):
...
@@ -66,13 +66,6 @@ def download(url, module_name, md5sum):
return
filename
return
filename
def
dict_add
(
a_dict
,
ele
):
if
ele
in
a_dict
:
a_dict
[
ele
]
+=
1
else
:
a_dict
[
ele
]
=
1
def
fetch_all
():
def
fetch_all
():
for
module_name
in
filter
(
lambda
x
:
not
x
.
startswith
(
"__"
),
for
module_name
in
filter
(
lambda
x
:
not
x
.
startswith
(
"__"
),
dir
(
paddle
.
v2
.
dataset
)):
dir
(
paddle
.
v2
.
dataset
)):
...
...
python/paddle/v2/dataset/imdb.py
浏览文件 @
4b5a4322
...
@@ -18,6 +18,7 @@ TODO(yuyang18): Complete comments.
...
@@ -18,6 +18,7 @@ TODO(yuyang18): Complete comments.
"""
"""
import
paddle.v2.dataset.common
import
paddle.v2.dataset.common
import
collections
import
tarfile
import
tarfile
import
Queue
import
Queue
import
re
import
re
...
@@ -48,10 +49,10 @@ def tokenize(pattern):
...
@@ -48,10 +49,10 @@ def tokenize(pattern):
def
build_dict
(
pattern
,
cutoff
):
def
build_dict
(
pattern
,
cutoff
):
word_freq
=
{}
word_freq
=
collections
.
defaultdict
(
int
)
for
doc
in
tokenize
(
pattern
):
for
doc
in
tokenize
(
pattern
):
for
word
in
doc
:
for
word
in
doc
:
paddle
.
v2
.
dataset
.
common
.
dict_add
(
word_freq
,
word
)
word_freq
[
word
]
+=
1
# Not sure if we should prune less-frequent words here.
# Not sure if we should prune less-frequent words here.
word_freq
=
filter
(
lambda
x
:
x
[
1
]
>
cutoff
,
word_freq
.
items
())
word_freq
=
filter
(
lambda
x
:
x
[
1
]
>
cutoff
,
word_freq
.
items
())
...
...
python/paddle/v2/dataset/imikolov.py
浏览文件 @
4b5a4322
...
@@ -17,6 +17,7 @@ imikolov's simple dataset: http://www.fit.vutbr.cz/~imikolov/rnnlm/
...
@@ -17,6 +17,7 @@ imikolov's simple dataset: http://www.fit.vutbr.cz/~imikolov/rnnlm/
Complete comments.
Complete comments.
"""
"""
import
paddle.v2.dataset.common
import
paddle.v2.dataset.common
import
collections
import
tarfile
import
tarfile
__all__
=
[
'train'
,
'test'
,
'build_dict'
]
__all__
=
[
'train'
,
'test'
,
'build_dict'
]
...
@@ -26,15 +27,14 @@ MD5 = '30177ea32e27c525793142b6bf2c8e2d'
...
@@ -26,15 +27,14 @@ MD5 = '30177ea32e27c525793142b6bf2c8e2d'
def
word_count
(
f
,
word_freq
=
None
):
def
word_count
(
f
,
word_freq
=
None
):
add
=
paddle
.
v2
.
dataset
.
common
.
dict_add
if
word_freq
is
None
:
if
word_freq
==
None
:
word_freq
=
collections
.
defaultdict
(
int
)
word_freq
=
{}
for
l
in
f
:
for
l
in
f
:
for
w
in
l
.
strip
().
split
():
for
w
in
l
.
strip
().
split
():
add
(
word_freq
,
w
)
word_freq
[
w
]
+=
1
add
(
word_freq
,
'<s>'
)
word_freq
[
'<s>'
]
+=
1
add
(
word_freq
,
'<e>'
)
word_freq
[
'<e>'
]
+=
1
return
word_freq
return
word_freq
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录