Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
Crayon鑫
Paddle
提交
c3fd2c28
P
Paddle
项目概览
Crayon鑫
/
Paddle
与 Fork 源项目一致
Fork自
PaddlePaddle / Paddle
通知
1
Star
1
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
1
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
Paddle
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
1
Issue
1
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
未验证
提交
c3fd2c28
编写于
12月 26, 2017
作者:
Q
qingqing01
提交者:
GitHub
12月 26, 2017
浏览文件
操作
浏览文件
下载
差异文件
Merge pull request #7002 from qingqing01/imdb_data
Speed data reader for IMDB dataset.
上级
f8391545
eb8edeb2
变更
1
隐藏空白更改
内联
并排
Showing
1 changed file
with
13 addition
and
40 deletion
+13
-40
python/paddle/v2/dataset/imdb.py
python/paddle/v2/dataset/imdb.py
+13
-40
未找到文件。
python/paddle/v2/dataset/imdb.py
浏览文件 @
c3fd2c28
...
...
@@ -23,10 +23,9 @@ Besides, this module also provides API for building dictionary.
import
paddle.v2.dataset.common
import
collections
import
tarfile
import
Queue
import
re
import
string
import
threading
import
random
__all__
=
[
'build_dict'
,
'train'
,
'test'
,
'convert'
]
...
...
@@ -74,47 +73,21 @@ def build_dict(pattern, cutoff):
return
word_idx
def
reader_creator
(
pos_pattern
,
neg_pattern
,
word_idx
,
buffer_size
):
def
reader_creator
(
pos_pattern
,
neg_pattern
,
word_idx
):
UNK
=
word_idx
[
'<unk>'
]
INS
=
[]
qs
=
[
Queue
.
Queue
(
maxsize
=
buffer_size
),
Queue
.
Queue
(
maxsize
=
buffer_size
)]
def
load
(
pattern
,
queue
):
def
load
(
pattern
,
out
,
label
):
for
doc
in
tokenize
(
pattern
):
queue
.
put
(
doc
)
queue
.
put
(
None
)
out
.
append
(([
word_idx
.
get
(
w
,
UNK
)
for
w
in
doc
],
label
))
load
(
pos_pattern
,
INS
,
0
)
load
(
neg_pattern
,
INS
,
1
)
random
.
shuffle
(
INS
)
def
reader
():
# Creates two threads that loads positive and negative samples
# into qs.
t0
=
threading
.
Thread
(
target
=
load
,
args
=
(
pos_pattern
,
qs
[
0
],
))
t0
.
daemon
=
True
t0
.
start
()
t1
=
threading
.
Thread
(
target
=
load
,
args
=
(
neg_pattern
,
qs
[
1
],
))
t1
.
daemon
=
True
t1
.
start
()
# Read alternatively from qs[0] and qs[1].
i
=
0
doc
=
qs
[
i
].
get
()
while
doc
!=
None
:
yield
[
word_idx
.
get
(
w
,
UNK
)
for
w
in
doc
],
i
%
2
i
+=
1
doc
=
qs
[
i
%
2
].
get
()
# If any queue is empty, reads from the other queue.
i
+=
1
doc
=
qs
[
i
%
2
].
get
()
while
doc
!=
None
:
yield
[
word_idx
.
get
(
w
,
UNK
)
for
w
in
doc
],
i
%
2
doc
=
qs
[
i
%
2
].
get
()
for
doc
,
label
in
INS
:
yield
doc
,
label
return
reader
...
...
@@ -133,7 +106,7 @@ def train(word_idx):
"""
return
reader_creator
(
re
.
compile
(
"aclImdb/train/pos/.*\.txt$"
),
re
.
compile
(
"aclImdb/train/neg/.*\.txt$"
),
word_idx
,
1000
)
re
.
compile
(
"aclImdb/train/neg/.*\.txt$"
),
word_idx
)
def
test
(
word_idx
):
...
...
@@ -150,7 +123,7 @@ def test(word_idx):
"""
return
reader_creator
(
re
.
compile
(
"aclImdb/test/pos/.*\.txt$"
),
re
.
compile
(
"aclImdb/test/neg/.*\.txt$"
),
word_idx
,
1000
)
re
.
compile
(
"aclImdb/test/neg/.*\.txt$"
),
word_idx
)
def
word_dict
():
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录