Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
Crayon鑫
Paddle
提交
aeeb77de
P
Paddle
项目概览
Crayon鑫
/
Paddle
与 Fork 源项目一致
Fork自
PaddlePaddle / Paddle
通知
1
Star
1
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
1
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
P
Paddle
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
1
Issue
1
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
提交
aeeb77de
编写于
11月 01, 2017
作者:
T
typhoonzero
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
simple pipe reader for hdfs or other service
上级
2e74cf46
变更
1
隐藏空白更改
内联
并排
Showing
1 changed file
with
98 addition
and
0 deletion
+98
-0
python/paddle/v2/reader/decorator.py
python/paddle/v2/reader/decorator.py
+98
-0
未找到文件。
python/paddle/v2/reader/decorator.py
浏览文件 @
aeeb77de
...
...
@@ -323,3 +323,101 @@ def xmap_readers(mapper, reader, process_num, buffer_size, order=False):
yield
sample
return
xreader
def
_buf2lines
(
buf
,
line_break
=
"
\n
"
):
# FIXME: line_break should be automatically configured.
lines
=
buf
.
split
(
line_break
)
return
lines
[:
-
1
],
lines
[
-
1
]
def
pipe_reader
(
left_cmd
,
parser
,
bufsize
=
8192
,
file_type
=
"plain"
,
cut_lines
=
True
,
line_break
=
"
\n
"
):
"""
pipe_reader read data by stream from a command, take it's
stdout into a pipe buffer and redirect it to the parser to
parse, then yield data as your desired format.
You can using standard linux command or call another program
to read data, from HDFS, Ceph, URL, AWS S3 etc:
cmd = "hadoop fs -cat /path/to/some/file"
cmd = "cat sample_file.tar.gz"
cmd = "curl http://someurl"
cmd = "python print_s3_bucket.py"
A sample parser:
def sample_parser(lines):
# parse each line as one sample data,
# return a list of samples as batches.
ret = []
for l in lines:
ret.append(l.split(" ")[1:5])
return ret
:param left_cmd: command to excute to get stdout from.
:type left_cmd: string
:param parser: parser function to parse lines of data.
if cut_lines is True, parser will receive list
of lines.
if cut_lines is False, parser will receive a
raw buffer each time.
parser should return a list of parsed values.
:type parser: callable
:param bufsize: the buffer size used for the stdout pipe.
:type bufsize: int
:param file_type: can be plain/gzip, stream buffer data type.
:type file_type: string
:param cut_lines: whether to pass lines instead of raw buffer
to the parser
:type cut_lines: bool
:param line_break: line break of the file, like
\n
or
\r
:type line_break: string
:return: the reader generator.
:rtype: callable
"""
if
not
isinstance
(
left_cmd
,
str
):
raise
TypeError
(
"left_cmd must be a string"
)
if
not
callable
(
parser
):
raise
TypeError
(
"parser must be a callable object"
)
process
=
subprocess
.
Popen
(
left_cmd
.
split
(
" "
),
bufsize
=
bufsize
,
stdout
=
subprocess
.
PIPE
)
# TODO(typhoonzero): add a thread to read stderr
# Always init a decompress object is better than
# create in the loop.
dec
=
zlib
.
decompressobj
(
32
+
zlib
.
MAX_WBITS
)
# offset 32 to skip the header
def
reader
():
remained
=
""
while
True
:
buff
=
process
.
stdout
.
read
(
bufsize
)
if
buff
:
if
file_type
==
"gzip"
:
decomp_buff
=
dec
.
decompress
(
buff
)
elif
file_type
==
"plain"
:
decomp_buff
=
buff
else
:
raise
TypeError
(
"file_type %s is not allowed"
%
file_type
)
if
cut_lines
:
lines
,
remained
=
_buf2lines
(
''
.
join
(
[
remained
,
decomp_buff
]),
line_break
)
parsed_list
=
parser
(
lines
)
for
ret
in
parsed_list
:
yield
ret
else
:
for
ret
in
parser
(
decomp_buff
):
yield
ret
else
:
break
return
reader
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录